Cloud-based secure human action recognition with fully homomorphic encryption

Abstract

The growth of cloud computing has encouraged resource-constrained data owners to upload skeleton data to the cloud for action identification, but this practice increases the risk of privacy breaches. Although the traditional anonymized privacy protection method protects the user privacy, it sacrifices the performance of action recognition. To solve the above problems, a Convolutional Neural Network (CNN) architecture compatible with Fully Homomorphic Encryption (FHE) is proposed in this paper, which can achieve secure action recognition without sacrificing the accuracy of action recognition. In addition, to solve the problem of low computational efficiency of the Residue Number System (RNS) variant of CKKS (RNS-CKKS) applied to CNN networks, a parallel fully homomorphic convolution method is designed to improve computational efficiency. To reduce the overhead of rotating key generation and transmission, a multi-layer key generation system is constructed. Finally, the superiority of the proposed model is verified on real data sets.

Full text

Translate

Turn on search term navigation

Introduction

By combining distributed computing, grid computing, and utility computing, cloud computing has features such as strong versatility, large scale, scalability, high availability, and on-demand services, which can effectively improve resource utilization and reduce the number of devices and system energy consumption. Therefore, the application of cloud computing technology in the field of action recognition can improve the efficiency of action recognition, which is conducive to the monitoring of sudden human behaviors and timely alarms (such as violent behaviors and dangerous behaviors). However, uploading monitoring information directly to a cloud server for action recognition faces the risk of privacy leakage [1]. In order to protect the privacy of users during action recognition, traditional methods use methods such as reducing video resolution [2], identifying and removing privacy areas [3], or training adversarial frameworks to learn anonymous functions [4] to remove privacy features to protect privacy information. These methods not only face huge workloads but may also lead to a decrease in the performance of action recognition models. Action recognition methods based on skeleton data make up for the shortcomings of traditional methods. Compared with video data, skeleton data have the characteristics of high abstraction, low complexity, and good robustness, exposing fewer details in the human activity process, and can protect user privacy to a certain extent [5].

Nevertheless, skeleton data are not absolutely secure. A study [6] has shown that the accuracy of using skeleton datasets to train classifiers to identify personal identities has reached 80%. In other words, even if users upload skeleton data that they believe to be safe, attackers can still recover the correct personal identity information with a high probability, causing user privacy leakage. Therefore, it is necessary to take measures to protect skeleton data to prevent the leakage of personal privacy. Most existing methods for protecting the privacy of skeleton data use anonymization methods, such as adversarial learning methods [6] or virtual skeleton anonymous methods [7]. Although they can avoid the leakage of user privacy information to a certain extent, they require sacrificing the accuracy of action classification, resulting in a significant reduction in action recognition accuracy.

FHE is capable of performing multiple rounds of arithmetic operations on data in an encrypted state, so it is considered a promising solution for secure outsourced computing [8]. Using FHE to protect the skeleton data can ensure that the skeleton data will not be modified due to anonymous operation, and ensure the accuracy of subsequent action identification. However, the application of FHE to the action recognition process of skeleton data has such problems as low reasoning efficiency and high overhead [9]. Compared with plaintext computation, the computational cost of ciphertext computation after homomorphic encryption is greatly increased, especially the multiplication operation is more expensive than other operations. In addition, in the CNN network calculation under FHE encryption, the server needs to accept the rotating key from the key generator, which will bring large computing and communication overhead. Because of the complexity of CNN networks, it is necessary to perform several cyclic shift operations to perform homomorphic calculations. Therefore, a large number of rotational key transfers will result in a sharp increase in the amount of traffic undertaken between the key provider and the server. For example, if we want to implement the ResNet-18 network under RNS-CKKS encrypted ImageNet dataset, we need about 600 rotation keys, and the key generator needs to generate about 111 GB of rotation keys for this, which is undoubtedly a huge burden for the client.

To address the above issues, the main contributions of this paper are as follows:

This paper proposes a multi-channel parallel fully homomorphic convolution method for secure and accurate action recognition in cloud computing scenarios. To address the problem of low computational efficiency of fully homomorphic convolution, the ciphertext encapsulation method is changed to encapsulate multi-channel ciphertexts in the same ciphertext, thus achieving parallel computation of multi-channel ciphertexts and effectively improving computational efficiency.

A method to collect the effective values of step convolution is designed to reduce the waste of ciphertext slots. In view of the waste of a large number of ciphertext slots caused by step convolution in CNN networks, the ciphertext encapsulation method is improved to compactly collect sparse output data of multiple channels, reduce the number of ciphertext rotations, and further improve the computational efficiency of convolution.

A multi-layer key generation system is proposed to reduce the communication overhead in the process of rotating key transmission. Considering the huge cost of rotating key in the action recognition process, a multi-layer key generation method is proposed. The server only needs the public key and a small part of the rotating key to generate any desired rotating key, which greatly reduces the computational cost of the key generation stage and the communication cost of the key transmission stage.

The rest of this paper is organized as follows. Section 2 briefly discusses the current work on skeleton data privacy protection and the application of FHE to CNN. Section 3 briefly describes the relevant definitions involved in this paper. Section 4 introduces the system model of this paper and explains the threat model. Section 5 introduces the detailed steps of the proposed original method, including the improved fully homomorphic convolution method and the multi-layer key generation method. Section 6 conducts a theoretical analysis of the security of the proposed scheme. Section 7 introduces the experimental verification of the proposed model under two real datasets and analyzes the relevant experimental results. Finally, Sect. 8 gives relevant conclusions.

Related work

Privacy protection for skeleton action recognition

At present, the three main motion recognition algorithms are based on RGB video, depth map sequence, and skeleton data [10]. Compared with RGB video and depth map sequences, skeleton data contain less personal privacy information, and personal gender, identity, age, and other information cannot be directly obtained from skeleton information, but this does not mean that skeleton data is completely safe. For example, Moon et al. [6] used Shift-GCN [11], MS-G3D [12], and 2 s-AGCN [13] to conduct gender and identity recognition experiments. The results showed that the average accuracy of the gender classifier reached 87%, and the average accuracy of the personal identity re-identification classifier reached 80%. An earlier study [14] also confirmed the possibility of identifying a person from the skeleton extracted from the Kinect dataset.

There are few existing privacy protection methods for skeleton data, and most of them use anonymization methods. Moon et al. [6] proposed an anonymization framework based on adversarial learning to anonymize skeleton datasets while retaining key actions. The anonymizer removes private information from the skeleton while maximizing the accuracy of the action classifier and minimizing the identifiability of private information with other classifiers to avoid potential privacy leakage of the skeleton dataset. In the metaverse scenario, Nair et al. [15] also used the idea of anonymization to train an "anonymizer" model, which anonymizes motion sequences by changing the user information embedding without changing the action information embedding to achieve the purpose of privacy protection. Although the anonymization method using adversarial learning can delete private information to a certain extent, they need to sacrifice the accuracy of action classification and cannot achieve a good balance between privacy protection and action recognition. At the same time, Carr et al. [7] proposed a new redirection-based method to deal with the possible privacy leakage of skeleton data in the metaverse. Classical and deep motion redirection is used to project the target skeleton onto a fake skeleton, thereby making the motion sequence anonymous to protect privacy. Although it can effectively avoid linking attacks, the action recognition accuracy is poor, and it is still necessary to further balance the privacy protection effect with the action recognition accuracy.

Optimization and acceleration of FHE

FHE plays an important role in privacy protection in many real-world applications such as healthcare [16], finance [17], and government [18]. Compared to other privacy protection methods, FHE can not only provide strong privacy protection in cryptography, but also perform calculations on encrypted data, achieving effective utilization of data. However, while providing strong privacy protection, FHE faces huge computational costs. Therefore, the research focus in the field of FHE is gradually shifting toward FHE acceleration. In the past, optimization and acceleration schemes for FHE could be divided into three categories: Number Theoretical Transform (NTT), Bootstrap, and Encoding.

The core idea of NTT algorithm optimization is to simplify the NTT operation by using the existing algorithm and selecting more suitable hardware to accelerate the operation. Agrawal et al. [19] use algorithm optimization to accelerate FHE, replace high-cost operations with low-cost operations, and adopt the Barrett reduction method [20] to improve high-cost modular operations. In addition, the shift and XOR operations in polynomial multiplication also speed up the NTT process. Bootstrap is an important performance bottleneck of FHE because of its high time complexity and space complexity. For the optimization of bootstrap process, most researchers adopt the method of reducing the memory bandwidth and improving the throughput. Chen et al. [21] focused on the bootstrap [22] acceleration of CKKS. They used the dynamic programming method [23] to determine the optimal layer folding strategy of general multilayer linear transformations to achieve the optimal trade-off between layer consumption and operation times, which greatly improved the bootstrap throughput. The coding optimization scheme is to use the coding method to change the data arrangement, so that the data has greater parallel computing capability, so as to accelerate the algorithm implementation.

Among the three acceleration methods mentioned above, relatively few schemes use encoding methods to optimize acceleration. Among the existing encoding optimization methods, Lee et al. [24] were the first to complete the calculation of the ResNet-20 model in the state of RNS-CKKS encryption and verified the realized model with the parameters of the plaintext model in the CIFAR-10 dataset. It reveals the possibility of applying FHE to advanced privacy-preserving machine learning models. Immediately afterward, Kim et al. [25] used ciphertext encapsulation technology to represent multiple nodes of the convolutional layer as the same ciphertext, avoiding the need to mix between different data layouts. Without the complexity of switching back and forth, speeding up security inference and significantly reducing memory usage. Although the above scheme uses ciphertext encapsulation encoding technology to significantly improve the inference speed of the CNN model under homomorphic encryption, there is still a situation of wasting a large number of ciphertext slots when facing strided convolutions. At the same time, they ignore the overhead problem caused by the large number of rotation keys required to apply RNS-CKKS in CNN networks. Therefore, this paper considers further designing the encoding method to improve the compactness of the ciphertext and achieve further optimization and acceleration of FHE. At the same time, considering the overhead of rotating keys, the key generation method is changed to reduce the overhead in the key generation process.

FHE applied to neural networks

The complex network structure of neural networks brings greater computational overhead to FHE applications. In the past, FHE was applied to neural network practices, CryptoNets [26] is an innovative secure computing protocol that implements secure reasoning of neural networks on MNIST datasets for the first time. The core idea of the protocol is to encrypt each node of the neural network into a different ciphertext, and then use these ciphertexts to simulate the normal neural network calculation process without decryption. Faster CryptoNets [27] reduces the computation of ciphertext multiplication by intelligently pruning network parameters, avoiding many computationally complex multiplication operations, and gradually quantifying the remaining network parameters to achieve maximum sparsity in the plaintext encoding. LoLa [28] used vectorization to optimize the representation of data. This method introduces transfer learning technology into the cryptographic neural network reasoning, which can effectively filter and protect the sensitive information in the image. LoLa uses ciphertext packaging technology to encode multiple values from different network nodes into a single ciphertext, thereby improving the processing efficiency of encrypted data without sacrificing security.

Although the application of FHE to neural networks has made some progress, the overall computational efficiency of the model is still not ideal due to the structural complexity of the neural network itself. Therefore, there is still a need to further design encoding methods to improve the computational efficiency of applying FHE to neural networks.

Preliminary knowledge

CNN-based skeleton action recognition

CNN is a feedforward neural network specially designed to process Euclidean structure data. Its core principle is to obtain local features at different positions in matrix data by sliding convolution kernels. CNN is often composed of one or more convolutional layers, corresponding nonlinear activation functions, pooling layers, and fully connected layers at the top.

CNN naturally has good spatial structural feature aggregation capabilities and has achieved excellent results in image classification tasks. Therefore, modeling the spatial structure of the skeleton through CNN has also become one of the research directions for solving skeleton action recognition tasks. The original skeleton coordinates and motion sequences can be directly input into CNN for label prediction.

The convolution layer is the main functional layer of CNN, and its core operation is the convolution operation. The convolution operation is a special linear operation that aggregates the element values in each local area into a single element in the output matrix by weighted summation using a filter (also called a convolution kernel) to obtain a feature map. The convolution kernel first covers an area of the same size, then performs a weighted summation on all element values in the area to obtain a value at the corresponding position on the feature map, then slides to the next position, and continues to perform weighted summation on the next area, repeating the cycle until the last slide is completed.

The activation function is a core concept in CNN. Its main function is to increase the nonlinear expression ability of the network, enable the network to learn more complex feature representations and improve the network's feature learning and classification capabilities.

The pooling layer reduces the size of the feature map by compressing the output feature map of the convolutional layer in blocks along the non-channel dimension. It is a standard subsampling operation. The pooling operation can fuse similar information in the feature map and filter redundant information. Compression of the feature map can also reduce the subsequent number of parameters and calculations, making the network more lightweight.

In conventional convolutional neural networks, the fully connected layer is generally placed at the end of the network to output the final classification (prediction) results.

RNS-CKKS FHE

RNS-CKKS is an FHE scheme for floating point numbers and can perform multiple consecutive arithmetic operations on encrypted data [29]. In the RNS-CKKS scheme, the ciphertext is represented as $(b, a) \in R_{Q}^{2}$ , where $Q$ is a large integer, $R_{Q} = Z_{Q} [X] / < X^{N} + 1 >$ . $N / 2$ slots of a single ciphertext can encrypt $N / 2$ real numbers (or complex numbers), denoting $N / 2$ as $n_{t}$ , and homomorphic operations can perform the same operation on each ciphertext slot at the same time. If the ciphertext of the vector $m \in R^{n_{t}}$ is simply represented as $[m]$ , then homomorphic addition, homomorphic multiplication, and rotation in the RNS-CKKS scheme can be described as:

[m] \oplus [n] = [m + n]

[m] ⊙ n = m ⊙ [n] = [m \cdot n]

R o t ([m] ; r) = [< m >_{r}]

where

\oplus

and

⊙

are addition and multiplication operators in ciphertext,

m \cdot n

represents distributed multiplication and

< m >_{r}

represents a vector

m

that is rotated

r

to the left.

For each homomorphic encrypted ciphertext, there is a non-negative integer $τ$ representing the homomorphic multiplicative capacity. After a homomorphic multiplication, the value of $τ$ is reduced by 1. After $τ$ homomorphic multiplication, the value of $τ$ becomes zero. To support subsequent multiplication operations, the bootstrap operation is performed to change the zero-level ciphertext to a higher-level ciphertext.

For the message vector $m$ whose size is $n$ and less than $n_{t}$ , the message is encrypted by sparse encapsulation. The specific process is to obtain the connection vector $(m | m | \dots | m) \in R^{n_{t}}$ from $m$ , and encrypt the connection vector with the ciphertext full slot.

A Key-Switching Operation (KSO) converts the key of a calculated ciphertext into a smaller key without changing the original information. KSO accounts for most of the computation time in RNS-CKKS schemes [30], so the number of times KSO is executed greatly affects the overall computation time.

Single-input homomorphic convolution

Gazelle [31] proposed a method of Single-Input Single-Output (SISO) convolution based on FHE data. In SISO, different ciphertexts containing input tensors are rotated with different values. The input ciphertext after rotation is multiplied with the plaintext vector of the corresponding filtering coefficient, and the output result is added after the multiplication results are added. Use $(p, q)$ to represent the SISO convolution result of the $q$ -th input and $p$ -th output. For example, for $c_{i}$ input channels, the SISO convolution result for the $p$ -output channel is $(p, 1) + (p, 2) + \dots + (p, c_{i})$ .

First, use a row-major mapping to transform the matrix in $R^{d_{1} \times d_{2}}$ into a vector of dimension $n = d_{1} d_{2}$ . For the matrix $A = (a_{ij}) \in R^{d_{1} \times d_{2}}$ , define a mapping $R^{d_{1} \times d_{2}} \to R^{n}$ to transform the matrix $A$ into a vector $vec (A) = (a_{11}, a_{12}, \dots, a_{1 d_{2}}, \dots, a_{d_{1} 1}, a_{d_{1} 2}, \dots, a_{d_{1} d_{2}})$ . For a three-dimensional tensor $R^{d_{1} \times d_{2} \times d_{3}}$ , it can be transformed into a vector $vec (A) = (vec (A_{1}) | vec (A_{2}) | \dots | vec (A_{d_{1}})$ of $R^{d_{1} \cdot d_{2} \cdot d_{3}}$ , where $A_{κ} \in R^{d_{2} \times d_{3}}$ is a matrix.

The tensor $A \in R^{h_{i} \times w_{i} \times c_{i}}$ is the input to the convolution, and $A' \in R^{h_{o} \times w_{o} \times c_{o}}$ is the output of the convolution, where $h$ and $w$ represent the height and width of the tensor, respectively, and $c$ is the number of input and output channels. For simplicity, this paper assumes that the horizontal and vertical strides of the convolution process are the same, and uses $s$ to represent the strides of the convolution.

Simple convolution for a single input $X$ in $R^{h_{i} \times w_{i}}$ through a single filter $(f_{h} \times f_{w})$ . The final output form of the convolutional layer is:

z_{i, j} = \sum_{|u| \leq α, |v| \leq β} x_{i \cdot s + u, j \cdot s + v} \cdot f_{u, v}

where

x_{i, j}

and

z_{i, j}

are the input data and output data of neurons in row

i

and column

j

, respectively,

α = ⌊f_{h} / 2⌋

β = ⌊f_{w} / 2⌋

, and

f_{u, v}

is the weight of column

v

in row

u

of the filter. In the process of convolution calculation, since the vector containing the input data is already in the ciphertext state, the ciphertext position can only be rearranged by rotation. The input is expressed as:

x_{i \cdot s + u, j \cdot s + ν} = v e c (X) [(i \cdot s + u) \cdot w + j \cdot s + ν] = ρ^{μ \cdot w + ν} (v e c (X)) [i \cdot s \cdot w + j \cdot s]

where

ρ^{r}

represents a rotation to the left by

r

positions, so Eq. (4) can be further expressed as:

\begin{matrix} z_{i, j} = & \sum_{| u | \leq α, | v | \leq β} (ρ^{u \cdot w + ν} (v e c (X)) [i \cdot s \cdot w + j \cdot s]) \cdot f_{u, ν} \\ = (\sum_{| u | \leq α, | v | \leq β} ρ^{u \cdot w + ν} (v e c (X)) \cdot f_{u, ν}) [i \cdot s \cdot w + j \cdot s] \end{matrix}

Therefore, a simple convolution on the ciphertext $c t_{X}$ of input $X$ can be expressed as:

H E - C o n v (c t_{X}, {p t . F_{u, ν}}) = \sum_{| u | \leq ⌊f_{h} / 2⌋} \sum_{| v | \leq ⌊f_{w} / 2⌋} M u l t P l a i n (ρ^{u \cdot w + ν} (c t_{X}), p t . F_{u, ν})

where

MultPlain(ct,pt)

represents the multiplication of plaintext

pt

and ciphertext

ct

pt {.F}_{u, ν}

is a plaintext polynomial with filter weights in place. Since the vector containing the filter coefficients is a plaintext vector, the weight coefficient of the filter can be placed anywhere in the plaintext vector.

System overview

System model

This paper adopts the method of FHE to encrypt the skeleton data and upload it to the cloud server. The cloud server stores the trained CNN model, performs action recognition with the skeleton data encrypted, generates encrypted action recognition results and sends them to the local monitoring service provider. The data of the entire process are in a strictly encrypted state, and the cloud server cannot see any information of the monitored object, avoiding the possibility of privacy information leakage. The entire system architecture is shown in Fig. 1.

[See PDF for image]

Fig. 1

System architecture

The whole system consists of four parts, namely the monitored object, cloud server provider, local monitoring service provider, and monitoring service user. The monitored object generates video data and extracts skeleton information from it. The cloud server is responsible for motion recognition of the uploaded encrypted information. The local monitoring service provider is responsible for generating public and private keys and rotating keys, and decrypting the encrypted motion information sent by the cloud server. Once a dangerous behavior occurs, it will send information to the monitoring service user in a timely manner.

The specific workflow is as follows:

Key generation

Generate the public key $pk$ , private key $sk$ , and rotation key $evk$ required for FHE at the local monitoring service provider.

Key distribution

The local monitoring service provider distributes the public key $pk$ to the monitored object, the rotating key $evk$ to the cloud server, and the private key $sk$ is retained.

Skeleton data encryption

After obtaining the surveillance video, the monitored object side extracts the skeleton data and encrypts it using the public key $pk$ . The encrypted skeleton data is then uploaded to the cloud server.

Homomorphic convolution

The encrypted skeleton data is transmitted to the cloud server and then input into the CNN network for action recognition. The action recognition results in the encrypted state are obtained and sent to the local monitoring service provider.

Decryption

The local monitoring service provider uses its own private key $sk$ to decrypt the identified action results.

Warning information sent

When the local monitoring service provider detects dangerous action information, it will send warning information to the monitoring service user in a timely manner.

Threat model

The system design in this paper is based on the premise that authenticated secure communication channels must be established between the monitored object and the cloud server, as well as between the cloud server and the local monitoring service provider, to ensure the security of the data and prevent unauthorized modification or identity impersonation. In addition, the local monitoring service providers should be independent of cloud servers and avoid any form of collusion. If they share data, the cloud can access the decrypted skeleton data, causing the monitored object information to leak. This article considers the following threat model.

All parties involved in the system are semi-honest (honest and curious), that is, they have some level of security, but the monitored object cannot fully trust them to handle sensitive data or perform critical tasks. The monitored objects do not want their skeleton information to be obtained by any participant to avoid privacy information disclosure.
A passive adversary constructs a ciphertext corresponding to a decrypted result by accessing the public standard interface provided by FHE and requires a decryption oracle (which has the ability to check whether the ciphertext is indeed legally generated) to respond with the decryption result, which may help the passive adversary distinguish the ciphertext or even recover the private key [32].

Action recognition under FHE

Ordinary action recognition model under FHE

This paper designs a simple action recognition CNN model, which contains a three-layer network structure. In each layer of the network structure, there is a convolution layer, followed by a batch normalization layer, an activation layer, and a maximum pooling layer. The step sizes of the three convolution layers are set to 1, 2, and 2, respectively. The network ends with an average pooling layer, a fully connected layer, and a softmax. The ReLU activation is replaced by a quadratic polynomial approximation, and the coefficients are adjusted during the training phase.

The convolution calculation process under FHE is shown in Fig. 2a and b is the matrix representation of the fully homomorphic convolution process.

[See PDF for image]

Fig. 2

Vector and matrix representation of the fully homomorphic convolution process

The input is encrypted to a single ciphertext as described above, that is, a mapping function converts the matrix to a one-dimensional vector, and then encrypts the resulting one-dimensional vector. The process of convolution can be described by sliding one function (i.e., convolution kernel or filter) along another function (input signal) and calculating the sum of the product of the two at each position. In the FHE state, a Single Instruction Multiple Data (SIMD) can be used to compute convolution results for all positions. This process can be realized by rotating the encrypted data. For the ciphertext with different rotation positions, multiply the plaintext polynomial of the filter weight corresponding to it to get different ciphertext results. Finally, all the obtained ciphertext are added together, which is the result of convolution calculation.

Multi-channel convolution is represented as the result of convolution of the input tensor $X \in R^{h_{i} \times w_{i} \times c_{i}}$ with $c_{i}$ filter groups ${F_{j} \in R^{c_{i} \times f_{h_{i}} \times f_{w_{i}}}}$ . The vector satisfies the following homomorphic properties:

D e c (ρ^{h_{i} \cdot w_{i} \cdot κ} (c t)) \approx ρ^{h_{i} \cdot w_{i} \cdot κ} (v e c (X)) = (v e c (X_{κ}) | v e c (X_{κ + 1}) | \dots | v e c (X_{κ + c_{i} - 1}))

For $0 \leq κ < c_{i}$ , starting with the first filter bank $F_{0} = {F_{0 κ} \in R^{f_{h_{i}} \times f_{w_{i}}}}_{κ = 0}^{c_{i} - 1}$ , for the first $h_{i} \cdot w_{i}$ plaintext slot, the convolution process consists of two steps: (1) Outer Rotation: the ciphertext is rotated by a multiple of $h_{i} \cdot w_{i}$ ; (2) Inner Rotation: at each rotation position $κ$ , a single channel convolution is performed, i.e., the convolution of $X_{κ}$ with the kernel $F_{0 κ}$ . This process is repeated for all rotation positions, and the results are summed to produce a single output channel.

The convolution layer parameters $c_{in}$ and $c_{out}$ represent the number of input and output channels, respectively. $c_{in}'$ and $c_{out}'$ indicate the number of input and output channels after ciphertext packaging. Therefore, the number of input ciphertexts is $n_{in} = ⌈c_{in} / c_{in}'⌉$ and the number of output ciphertexts is $n_{out} = ⌈c_{out} / c_{out}'⌉$ . Assume that a ciphertext $c t_{i}$ represents the ciphertext input of the $i$ -th input channel. For $j = 1, 2 . . ., n_{out}$ , the multi-channel convolution result for the $j$ -th output channel is expressed as follows:

\begin{matrix} H E - C o n v_{j} (c t_{1}, \dots, c t_{n_{in}}, (p t . F_{i, j, l, κ})) = & \sum_{1 \leq i \leq n_{in}} \sum_{0 \leq κ < c_{in}'} H E - C o n v (ρ^{d i s t . κ} (c t_{i}), {p t . F_{i, j, l, κ}}) \\ = \sum_{1 \leq i \leq n_{in}} \sum_{0 \leq κ < c_{in}'} \sum_{| l | \leq f} M u l t P l a i n (ρ^{r_{l} + d i s t . κ} (c t_{i}), p t . F_{i, j, l, κ}) \end{matrix}

where

f = (1 + 2 \cdot α) \cdot (1 + 2 \cdot β)

p t . F_{i, j, l, κ}

is the plaintext polynomial with filter weights in place,

dist

represents the distance between two adjacent channels on the plaintext slot, such as

h_{i} \cdot w_{i}

in (8), and

r_{l}

represents the rotation of the ordinary convolution, such as

u \cdot w + v

in (7).

Improved fully homomorphic convolution

Since the most time-consuming part of implementing a standard CNN on the RNS-CKKS scheme is the KSO operation, it is desirable to reduce the KSO runtime. In order to reduce the number of KSOs in CNN reasoning, the number of rotations should be reduced as much as possible, since rotations require KSOs. SIMD can process addition and multiplication operations in parallel during convolution calculations, which greatly improves the efficiency of homomorphic computing. However, the SIMD method cannot directly reduce the number of multiplication operations in the multi-channel convolution calculation process, and therefore cannot reduce the number of KSO operations. To reduce the total runtime, the intermediate data should be packed into ciphertext as compactly as possible during the inference phase.

The use of multiplexed data encapsulation can effectively reduce the number of rotations and improve computational efficiency. In this padding method, the plaintext tensor of size $h_{i} \times w_{i}$ in the $z_{i}^{2}$ channel is first mapped to a larger multiplexed tensor of size $z_{i} h_{i} \times z_{i} w_{i}$ . Then, multiple multiplexed tensors are encrypted in one ciphertext. For $t_{i} = ⌈\frac{c_{i}}{z_{i}^{2}}⌉$ , multiplexed encapsulation is a function that maps tensor $A = {(A_{i_{0}, i_{1}, i_{2}})}_{0 \leq i_{0} < h_{i}, 0 \leq i_{1} < w_{i}, 0 \leq i_{2} < c_{i}} \in R^{h_{i} \times w_{i} \times c_{i}}$ to ciphertext $Enc (Vec (A')) \in R^{n_{t}}$ , where $A' = {(A_{i_{3}, i_{4}, i_{5}}^{'})}_{0 \leq i_{3} < z_{i} h_{i}, 0 \leq i_{4} < z_{i} w_{i}, 0 \leq i_{5} < t_{i}} \in R^{z_{i} h_{i} \times z_{i} w_{i} \times t_{i}}$ is a multiplexed tensor.

A_{i_{3}, i_{4}, i_{5}}^{'} = \{\begin{matrix} A_{⌊i_{3} / z_{i}⌋, ⌊i_{4} / z_{i}⌋, γ}, & i f γ < c_{i} \\ 0, & otherwise \end{matrix})

where

\begin{matrix} 0 \leq i_{3} < z_{i} h_{i} \end{matrix}

0 \leq i_{4} < z_{i} w_{i}

0 \leq i_{5} < t_{i}

γ = z_{i}^{2} i_{5} + z_{i} (i_{3} mod z_{i}) + i_{4} mod z_{i}

. Figure 3 describes the multi-channel packaging of

h_{i} = w_{i} = 4

and

z_{i} = 2

[See PDF for image]

Fig. 3

Multi-channel data encapsulation method

On the basis of multi-way data packing, not only the convolution with stride one (i.e., $s = 1$ ) should be considered, but also the case where the stride is greater than 1 in the actual convolutional neural network architecture. The strided convolution setting can capture a wider range of contextual information without increasing the parameters and computational complexity, and improve the perception ability of the model. However, the process of striding convolution will generate a large number of blank ciphertext slots. After each striding convolution, the gap between valid data increases by $s$ times, resulting in a reduction in filling density by $s^{2}$ times, as shown in Fig. 4. For CNNS with a large number of step operations, the computation time of the entire network increases dramatically with the number of step operations. In order to effectively reduce the running time of step convolution, it is necessary to solve the problem of low filling density and improve the utilization rate of ciphertext slots. Therefore, this paper attempts to reduce the running time by compactly packing these sparse data in a way that extracts valid values.

[See PDF for image]

Fig. 4

Strided convolution process

This paper considers the case of strided convolution (i.e., $s \geq 2$ ), and selects and collects the effective values of the output gap $k_{o} = s k_{i}$ . The convolution filter is $U \in R^{f_{h} \times f_{w} \times c_{i} \times c_{o}}$ . First, define the $MultWgt (U ; i_{0}, i_{1}, i)$ function to map the weight tensor $U$ to the elements in $R^{n_{t}}$ . Then define a three-dimensional multiple shift weight vector ${\bar{U}}^{' (i_{0}, i_{1}, i)} = {({\bar{U}}_{i_{3}, i_{4}, i_{5}}^{' (i_{0}, i_{1}, i)})}_{0 \leq i_{3} < z_{i} h_{i}, 0 \leq i_{4} < z_{i} w_{i}, 0 \leq i_{5} < t_{i}} \in R^{z_{i} h_{i} \times z_{i} w_{i} \times t_{i}}$ , where $0 \leq i_{0} < f_{h}$ , $0 \leq i_{1} < f_{w}$ , $0 \leq i < c_{o}$ , are defined as follows:

{\bar{U}}_{i_{3}, i_{4}, i_{5}}^{' (i_{0}, i_{1}, i)} = \{\begin{matrix} 0, & \begin{matrix} i f z_{i}^{2} i_{5} + z_{i} (i_{3} mod z_{i}) + i_{4} mod z_{i} \geq c_{i} \\ o r ⌊i_{3} / z_{i}⌋ - (f_{h} - 1) / 2 + i_{0} \notin [0, h_{i} - 1] \\ o r ⌊i_{4} / z_{i}⌋ - (f_{w} - 1) / 2 + i_{1} \notin [0, w_{i} - 1] \end{matrix} \\ U_{i_{0}, i_{1}, z_{i}^{2} i_{5} + z_{i} (i_{3} mod z_{i}) + i_{4} mod z_{i}, i}, & otherwise \end{matrix})

Among them, $0 \leq i_{3} < z_{i} h_{i}$ , $0 \leq i_{4} < z_{i} w_{i}$ , $0 \leq i_{5} < t_{i}$ . $MultWgt$ function is defined as $MultWgt (U ; i_{0}, i_{1}, i) = Vec ({\bar{U}}^{' (i_{0}, i_{1}, i)})$ .

In addition to the weight tensor, you also need to define a multi-way selection tensor $S^{' (i)} = {(S_{i_{3}, i_{4}, i_{5}}^{' (i)})}_{0 \leq i_{3} < z_{o} h_{o}, 0 \leq i_{4} < z_{o} w_{o}, 0 \leq i_{5} < t_{o}} \in R^{z_{o} h_{o} \times z_{o} w_{o} \times t_{o}}$ to select the valid value after convolution, where $t_{o} = ⌊\frac{c_{o}}{z_{o}^{2}}⌋$ is defined as follows:

S_{i_{3}, i_{4}, i_{5}}' = \{\begin{matrix} 1, & i f z_{0}^{2} i_{5} + z_{0} (i_{3} mod z_{0}) + i_{4} mod z_{0} = i \\ 0, & otherwise \end{matrix})

Among them, $0 \leq i_{3} < z_{o} h_{o}$ , $0 \leq i_{4} < z_{o} w_{o}$ , $0 \leq i_{5} < t_{o}$ . Figure 5 describes the multi-way convolution process at $s = 2$ .

[See PDF for image]

Fig. 5

Schematic diagram of multi-way parallel strided convolution

In Fig. 5, the four channels of ciphertext data are integrated into a larger tensor structure, which is achieved through the multiplexing technique. Similarly, the filter coefficients of these four channels are encapsulated in a larger matrix. Next, the SISO convolution operation is performed on the four channels simultaneously. After the convolution process, the convolution results of these four channels are combined by means of rotation and homomorphic addition. In the end, only the valid portion of the merged data is preserved.

Multi-layer key generation

The rotation operation $(v_{i}) \to (v_{i + r})$ of the CKKS scheme is an operation map in the encrypted state, which can be unified to $m (X) \to m (X^{5^{r}})$ for ring elements. For ciphertext $m (X)$ in key $s (X)$ , the processed ciphertext first meets $b (X^{5^{r}}) + a (X^{5^{r}}) \cdot s (X^{5^{r}}) \approx m (X^{5^{r}})$ for key $(b (X), a (X)) \to (b (X^{5^{r}}), a (X^{5^{r}}))$ , that is, ciphertext $m (X^{5^{r}})$ with key $s (X^{5^{r}})$ . If you want to convert this ciphertext to the same plaintext ciphertext with the original key $s (X)$ , $m (X^{5^{r}})$ can be changed to $m (X)$ by using the rotation key of loop shift $r$ .

This paper designs a "master rotating key" by placing a multi-layer structure on the rotating keys, which allows the server to generate rotating keys using public keys, and the rotating keys of higher levels can be used to generate rotating keys of lower levels. The "master rotating key" corresponds to the highest level rotating key, and the rotating key used for calculation is the rotating key of the lowest level (level 0), that is, the "original key". The stratification of these tiers is determined by the magnitude of the overall modulus associated with each key that undergoes rotation, as well as the unique modulus values that are employed. Figure 6 compares the differences between the conventional rotating key generation method and the multi-layer rotating key generation method.

[See PDF for image]

Fig. 6

Comparison between the conventional rotating key generation method and the multi-layer key generation method a Conventional rotation key generation method b Hierarchical rotation key generation method

Figure 6a illustrates the case where the conventional method is used to generate rotating keys. The local monitoring service provider needs to generate and convey all the necessary keys for the rotational operations, which is a heavy burden on the local monitoring service provider. Figure 6b generates a rotation key for a multi-layer key generation system. No matter how many rotation operations are required by the server, the local monitoring service provider only needs to send a small number of advanced rotation keys and public keys.

In the $λ$ -level multi-layer key generation system, there are $λ$ groups of rotating keys, their levels from key level $λ - 1$ to 0. A higher-level rotating key can be used to generate additional lower level keys. The main algorithms for multi-layer key generation are TopKeyGen and SimKeyGen. TopKeyGen generates the highest level of rotating keys by the local monitoring service provider. SimKeyGen generates common level of rotating keys using a rotating key set of public keys and high-level keys. The definitions of the two algorithms are as follows.

$TopKeyGen (s, T_{λ - 1}) \to {g k_{i}^{(λ - 1)}}_{i \in T_{λ - 1}}$ : Given keys $s$ and cyclic shift $T_{λ - 1}$ , generate the highest level rotated key ${g k_{i}^{(λ - 1)}}_{i \in T_{λ - 1}}$ under the cyclic shift $T_{λ - 1}$ .

$SimKeyGen (ℓ, U_{ℓ}, {g k_{i}^{(ℓ_{i})}}_{i \in U_{ℓ}}, p k, T_{ℓ}) \to {g k_{i}^{(ℓ)}}_{i \in T_{ℓ}}$ : Under the condition of known public key $pk$ and a rotating key ${g k_{i}^{(ℓ_{i})}}_{i \in U_{ℓ}}$ of level $ℓ_{i}$ , the cyclic shift set for generating the rotating key is $U_{ℓ}$ , where $ℓ_{i} > ℓ$ . The cyclic shift set provided by the server is $T_{ℓ}$ , and a rotating key ${g k_{i}^{(ℓ)}}_{i \in T_{ℓ}}$ with key level $ℓ$ and cyclic shift $T_{ℓ}$ is generated in key management server.

The key $g k_{i}^{(ℓ)}$ represents the rotation key of level $ℓ$ with a cyclic shift of $i$ . Each key has a set of rotational shifts, which is composed of a series of whole numbers, denoted by $T_{0}, \dots, T_{λ - 1}$ , and these sets are disjointed pairs. For a cyclic shift set greater than class $ℓ$ , it is represented by $U_{ℓ}$ . If all the required rotation keys in bond levels greater than $ℓ$ have been generated, then $⋃_{i = ℓ + 1}^{λ - 1} T_{i}$ . The conventional system of rotating keys may be regarded as a special instance within the context of the newly introduced multi-layer key rotation framework.

The specific process is shown in Algorithm 1.

[See PDF for image]

Algorithm 1

Multi-layer key generation algorithm with level $λ$ keys.

The multi-layer rotating key generation scheme proposed in this paper uses public key $(b, a)$ and the highest rotating key generated by the local monitoring service provider to generate a rotating key with a key level less than $λ - 1$ . The set of cyclic shifts of the rotating keys with key levels greater than $ℓ$ is called $T_{ℓ}$ that allow repetition and generate all rotating keys with key level $ℓ$ . Because $T_{λ - 1}$ is small, the computational workload and communication cost can be reduced.

If $b + a \cdot s = m + e mod q$ is satisfied, then $(b, a) \in R_{q}^{2}$ can be defined as a valid ciphertext of ciphertext $m$ , where $e$ is a polynomial whose coefficients are less than those of $q$ . For cyclic shift $r$ , if every $g k_{r, i} = (b_{r, i}, a_{r, i}) \in R_{PQ}^{2}$ is valid for $P \cdot {\hat{Q}}_{i} \cdot {[{\hat{Q}}_{i}^{- 1}]}_{Q_{i}} \cdot s (X^{5^{r}})$ , then a key exchange procedure can be executed on the encrypted data within the modulus $q$ framework, where $q$ is a divisor of $Q$ , $Q$ is the evaluation modulus, $P$ is the special modulus, ${\hat{Q}}_{i} = \prod_{j \neq i} Q_{j}$ .

In the TopKeyGen operation scheme, the advanced key generation method for a single rotation cyclic shift $r$ and key polynomial $s \in R$ is $g k_{r}^{(λ - 1)} = {g k_{r, i}^{(λ - 1)}}_{i = 0, \dots, {hdnum}_{λ - 1} - 1}$ , where $g k_{r, i}^{(λ - 1)} = (b_{r, i}^{(λ - 1)}, a_{r, i}^{(λ - 1)}) \in R_{Q_{λ - 1} P_{λ - 1}}^{2}$ , $a_{r, i}^{(λ - 1)} \leftarrow R_{Q_{λ - 1} P_{λ - 1}}$ , $b_{r, i}^{(λ - 1)} = - a_{r, i}^{(λ - 1)} \cdot s + e_{r, i}^{(λ - 1)} + P_{λ - 1} \cdot {\hat{Q}}_{λ - 1, i} \cdot {[{\hat{Q}}_{λ - 1, i}^{- 1}]}_{Q_{λ - 1, i}} \cdot s (X^{5^{r}})$ , where $e_{r, i}^{(λ - 1)} \leftarrow χ$ .

The SimKeyGen operation consists of two basic algorithms, TopToSim and SimToSim. The function of the TopToSim algorithm is to generate low level rotation key from high level rotation key in $r - s h i f t$ . The function of the SimToSim algorithm is to generate $r + r^{'} - s h i f t$ rotation keys from $ℓ$ class $r^{'} - s h i f t$ rotation keys. If the local monitoring service provider generates and transmits the highest-level rotating key set corresponding to the cyclic shift set $S = {r_{0}, r_{1}, \dots, r_{t - 1}}$ to the server, the server can generate some low-level $\sum_{i = 0}^{t - 1} w_{i} r_{i} - s h i f t$ -rotating keys by sequentially calculating the TopToSim operation and the SimToSim operation.

For changing the cyclic shift SimToSim operation, define $s h i f t - r$ as the rotating key under cyclic shift $r$ , $(r, ℓ)$ as the rotating key under cyclic shift $r$ with key level $ℓ$ . The purpose of SimToSim is to generate a $(r + r^{'}, ℓ)$ rotation key from a $(r, ℓ)$ rotation key. $g k_{r, i}^{(ℓ)}$ is a valid rotation key corresponding to ciphertext $P_{ℓ} \cdot {\hat{Q}}_{ℓ, i} \cdot {[{\hat{Q}}_{ℓ, i}^{- 1}]}_{Q_{ℓ, i}} \cdot s (X^{5^{r}})$ , and if $g k_{r, i}^{(ℓ)}$ is rotated by a cyclic shift $r'$ , the ciphertext will become:

P_{ℓ} \cdot {\hat{Q}}_{ℓ, i} \cdot {[{\hat{Q}}_{ℓ, i}^{- 1}]}_{Q_{ℓ, i}} \cdot s ({(X^{5^{r^{'}}})}^{5^{r}}) = P_{ℓ} \cdot {\hat{Q}}_{ℓ, i} \cdot {[{\hat{Q}}_{ℓ, i}^{- 1}]}_{Q_{ℓ, i}} \cdot s (X^{5^{r + r^{'}}})

For TopToSim operation, the server receives the highest-level spin key from the local monitoring service provider and needs to generate the lower-level spin key from the highest-level spin key first. On the basis of having a low-level rotation key, we can generate a $s h i f t - 0$ rotation key first, and SimToSim operation on the basis of $s h i f t - 0$ rotation can generate a $s h i f t - r'$ rotation key. $s h i f t - 0$ The definition of a rotation bond is $g k_{0}^{(λ - 1)} = {g k_{0, i}^{(λ - 1)}}_{i = 0, \dots, h d n u m_{λ - 1} - 1}$ , where $g k_{0, i}^{(λ - 1)} = (b_{0, i}^{(λ - 1)}, a_{0, i}^{(λ - 1)}) \in R_{Q_{λ - 1} P_{λ - 1}}^{2}$ , $b_{0, i}^{(λ - 1)} = - a_{0, i}^{(λ - 1)} \cdot s + e_{0, i}^{(λ - 1)} + P_{λ - 1} \cdot {\hat{Q}}_{λ - 1, i} \cdot {[{\hat{Q}}_{λ - 1, i}^{- 1}]}_{Q_{λ - 1, i}} \cdot s$ .

The process of generating ${gk}_{0, i}^{(ℓ)}$ from public key $(b, a) \in R_{Q_{k - 1}}^{2}$ , extracting the value of the corresponding RNS module, and reducing the public key to $(b^{'}, a^{'}) = (b mod Q_{ℓ} P_{ℓ}, a mod Q_{ℓ} P_{ℓ}) \in R_{Q_{ℓ} P_{ℓ}}^{2}$ . Then, set $b_{0, i}^{(ℓ)} = b^{'}$ , $a_{0, i}^{(ℓ)} = a^{'} + P_{ℓ} \cdot {\hat{Q}}_{ℓ, i} \cdot {[{\hat{Q}}_{ℓ, i}^{- 1}]}_{Q_{ℓ, i}}$ . Then, we can get $b_{0, i}^{(ℓ)} = - a_{0, i}^{(ℓ)} \cdot s + e_{0, i}^{(ℓ)} + P_{ℓ} \cdot {\hat{Q}}_{ℓ, i} \cdot {[{\hat{Q}}_{ℓ, i}^{- 1}]}_{Q_{ℓ, i}} \cdot s$ . If $(b_{0, i}^{(ℓ)}, a_{0, i}^{(ℓ)})$ is defined as $g k_{0, i}^{(ℓ)}$ , then the set $g k_{0}^{(ℓ)} = {g k_{0, i}^{(ℓ)}}_{i = 0, \dots, h d n u m_{ℓ} - 1}$ is a valid $(0, ℓ)$ -rotation key. Finally, a SimToSim operation is performed to generate a $s h i f t - r$ rotation key.

Security analysis

Data security

Assume that all parties are honest and curious in the security model, that is, all parties follow all steps and do not know any additional information except the information they have accessed. Under this model, the communication between the client and the server is secure and will not be attacked by external adversaries. During the entire transmission process, even if the channel is monitored or intercepted, the attacker will only obtain the feature template or feature value after the monitor object's encryption. Without the monitor object's key, the original data cannot be obtained, and the monitor object's key will not be disclosed to the outside, thus ensuring the security of the encrypted data.

Theorem 1: The homomorphic encryption scheme based on ring learning and error assumptions is indistinguishable under chosen plaintext attack (IND-CPA) [33].

IND-CPA security is defined as follows: for any probabilistic polynomial-time attacker $A$ , the following advantage function is trivial:

A d v_{A, E n c} (ν) = | Pr [A (E n c_{ν} (M_{0})) = 1] - Pr [A (E n c_{ν} (M_{1})) = 1] |

Among them, $M_{0}$ and $M_{1}$ are the two plaintexts selected by the challenger, ${Enc}_{κ}$ is the encryption algorithm, and $ν$ is the security parameter. Suppose an attacker $A$ can distinguish ${Enc}_{ν} (M_{0})$ and ${Enc}_{ν} (M_{1})$ in polynomial time, build an algorithm $B$ , receive an instance $(a, b)$ , choose $M_{0}$ , $M_{1}$ and calculate $c = Enc (M_{b}) + a mod q$ , give $c$ to $A$ , $A$ get guess $b'$ , $B$ checks whether $b' = b$ . Since the ring learning error problem is intractable, there is no effective $A$ that can significantly distinguish ${Enc}_{ν} (M_{0})$ and ${Enc}_{ν} (M_{1})$ . Therefore, the homomorphic encryption scheme based on the ring learning error is safe under the IND-CPA attack.

All calculations on the server are performed in an encrypted state. Therefore, due to the IND-CPA security of homomorphic encryption itself, the server cannot understand the monitor object's information and can protect the monitor object's own privacy information.

Key security

For the CKKS scheme itself, even if the attacker uses a key recovery attack or forgery attack, assuming that the key of CKKS in the system is set to 128 bits, the key space size is $2^{128}$ , and the key size of CKKS can be set to more than 128, thereby obtaining higher security. Therefore, the CKKS encryption system provides a higher security level. Consider that CKKS schemes are still at risk of being cracked by linear key recovery attacks. Let the security parameter modulus be $q$ , the ring size be $n$ , and $k = O (n)$ , generate the private key $sk$ , the public key $pk$ , the rotation key $evk$ , and encrypt the ciphertext $c t_{i} (1 \leq i \leq k)$ . Use the rotation key $evk$ for homomorphic operation, and use the decryption and decoding to obtain the plaintext:

m_{i} = D e c o d e (D e c_{sk} (c t_{i}), Δ)

where

Δ

is the scaling factor. If

Encode (m_{i}, Δ) = {Dec}_{sk} (c t_{1})

is satisfied for all

m_{i} (1 \leq i \leq k)

, the private key

sk

can be recovered with a high probability. If

q

is a prime number or the product of different prime numbers, as long as

k \geq 1

, the above content is satisfied. However, if the decrypted plaintext result is only disclosed to the key owner, the CKKS scheme is secure and will not be attacked by key recovery. Given the decryption outcomes from the local monitoring service provider are not disclosed to any third parties, the protocol remains impervious to key recovery assaults.

Performance evaluation

In this section, to verify the performance of the proposed method, simulations are performed on two real public datasets, HMDB51 [34] and UR Fall Detection Dataset (URFD) [35]. This paper uses the SEAL library [36] released by Microsoft to implement the proposed model, which is equipped with a bootstrapped implementation of RNS-CKKS. The simulation environment of this experiment is the Ubuntu 22.04.3 LTS 64-bit operating system, the processor is Intel i3-10,100 CPU @ 3.60 GHz, and 16.0 GB memory. The neural network framework is built using PyTorch 1.13.1, and the experimental process is implemented in Python and C + + .

Dataset

HMDB51 is a motion recognition dataset of 6849 video clips covering 51 categories of daily movements, with at least 101 samples per category, mostly from movies, but also from public databases such as YouTube and Google Video.

URFD is a human action detection dataset containing 70 videos, including 30 falling actions and 40 daily actions. The data contains three modalities: RGB camera, depth camera, and sensor data.

The samples of the above two datasets were combined, and the combined dataset was divided into the training set and the test set according to the ratio of 7:3. For the video clips in the dataset, the skeleton joint point data are arranged into a tensor of size 2 × 15 × 32 by concatenating the 15 joint positions detected by HRNet [37] in the generated 32-frame clips.

Simulation results

Performance comparison with other models

In order to verify the superiority of the model proposed in this paper, action recognition performance tests are carried out on different models, and the average running time and throughput are shown in Table 1.

Table 1. Comparison of average running time and throughput of action recognition under different models

Model	Running time(s)	Throughput(samples/s)
[26]	289.936	0.003
[27]	67.395	0.015
[28]	12.392	0.081
[24]	6.855	0.146
Ours	2.152	0.361

It can be seen that the method proposed in this paper has made great progress in both running time and throughput. Compared with the sub-optimal model, the running time and throughput are increased by 0.69 times and 1.47 times respectively. The results show that the proposed model can effectively improve the efficiency and throughput of action recognition under full homomorphic encryption, and recognize user actions in time on the premise of ensuring user data security. In the baseline model, CryptoNets, as the first protocol to implement secure neural network inference on MNIST data, has poor performance in model runtime and throughput. Faster CryptoNets proposes a pruning and quantization method that uses sparse representations in the underlying cryptosystem to speed up inference, which is an order of magnitude faster than CryptoNets, but still not ideal; LoLa also uses the method of ciphertext processing, through ciphertext packaging, multiple values from network nodes are represented as the same ciphertext, which can optimize the delay to a certain extent. LEE et al. 's scheme is inferior to this scheme in terms of running time and throughput because the ciphertext arrangement is not tight enough in the calculation process, resulting in a large number of blank ciphertext slots.

By comparing the memory usage of different schemes at different stages, the results shown in Fig. 7 are obtained.

[See PDF for image]

Fig. 7

Comparison of memory usage at different stages of different models

As can be seen from Fig. 7, the scheme in this paper significantly reduces memory usage due to the extensive use of ciphertext encapsulation and SIMD calculations. In the LoLa scheme, the operation is simplified by different representations of convolution layers. However, these simplified processes lead to a large number of homomorphic encryption operations when dealing with high-dimensional inputs, which makes the LoLa model not ideal in terms of memory consumption; other methods, such as CryptoNets and Faster CryptoNets, use batch axis packaging methods for optimization, which can be optimized to varying degrees to reduce memory usage; although LEE et al.'s scheme also uses a large number of SIMD calculation methods, since there is no ciphertext encapsulation technology similar to this scheme, the memory usage is significantly higher than that of this scheme.

Comparison of classification performance

In order to test the classification performance of the model, the action "fall" is taken as an example, and typical performance indicators such as classification accuracy, sensitivity, specificity, precision, and F1 score are used to compare with the classification performance of baseline models. The results are shown in Table 2.

Table 2. Action classification performance indicators

	accuracy	sensitivity	specificity	precision	F1-score
Raw Data	0.8933	0.8896	0.9134	0.8967	0.8931
[6]	0.8064	0.7932	0.8263	0.8092	0.8011
[7]	0.6275	0.6285	0.6328	0.6293	0.6289
Ours	0.8814	0.8889	0.9047	0.8916	0.8902

The results show that compared with the privacy protection method of skeleton anonymization, the method proposed in this paper can better protect the integrity of skeleton data and thus ensure the accuracy of action recognition. The results show that although there are differences between the performance indicators of encrypted and unencrypted data due to slight homomorphic errors, the overall performance difference is between 0.0007 and 0.012, which does not significantly reduce the accuracy of the action recognition model. These results show that the action recognition method under FHE proposed in this paper can achieve privacy protection of skeleton data with slight loss.

In order to better verify the classification effect of the action recognition model under FHE, the action recognition confusion matrix comparing the original data and the encrypted data is shown in Fig. 8.

[See PDF for image]

Fig. 8

Confusion matrix comparison of original data and data action recognition in FHE state a Raw Data b FHE Data

By comparing the confusion matrices of the original data and the encrypted data, it can be more intuitively seen that the encrypted skeleton data can still maintain a high accuracy of action recognition. The maximum difference between the encrypted action recognition accuracy and the original data recognition accuracy under various actions is 0.0094. At the same time, it can be seen from the figure that the recognition accuracy of the encrypted action is not necessarily lower than that of the original action. For example, the encrypted data recognition accuracy of the "pick" and "stand" actions is higher than the original data recognition results. Therefore, it can be considered that the method proposed in this paper guarantees the accuracy of skeleton action recognition to a large extent.

Communication overhead

In order to verify the effect of the multi-layer key generation system proposed in this paper, the values of parameter $λ$ are selected as 1, 2 and 3 respectively to analyze the influence of different layers of key generation on the communication overhead. The results are shown in Fig. 9.

[See PDF for image]

Fig. 9

Running time and communication overhead of the rotating key generation system with different numbers of layers

It can be seen that when $λ$ is 1, that is, using the traditional rotating key generation method, the rotating key has the largest overhead, reaching 2.6 GB, and the time required to generate the entire key is about 14.6 s. When using a 2-layer key generation system, the key overhead has been significantly reduced, about 1.8 GB. At the same time, the time consumption of generating a high-level rotating key on the key generator is only 2.2 s, and the time consumption of generating a low-level rotating key on the server is 5.5 s, and the overall key generation time is 7.7 s. Compared with traditional key generation methods, the multi-layer key system can not only significantly reduce communication overhead, but also make key generation faster. Finally, when the value of parameter $λ$ is selected to be 3, the communication overhead is further reduced on the basis of $λ$ = 2. The key generator's high-level key generation time is 2.1 s, and the server's time to generate other low-level rotating keys is 6.0 s. The overall key generation time is 8.1 s, which is slightly longer than the 2-layer key generation system. This is because on the server side, one more low-level key generation process is required than in the 2-layer system. Overall, the multi-layer key generation system can reduce the cost of the rotation key generation and transmission process to a great extent, significantly reduce the key generation time, and effectively improve the overall performance of the system.

Conclusion

This paper proposes a method for action recognition under FHE, which can use the action recognition model for reasoning under the state of skeleton data encryption while protecting user privacy and ensuring the accuracy of action recognition. Aiming at the low computational efficiency in the FHE state, a multi-channel parallel convolution scheme is designed and the effective information after convolution calculation is collected to improve the compactness of ciphertext and reduce the number of rotations. Finally, a multi-layer key generation system is constructed so that the key generation end only needs to generate high-level rotation keys instead of complete rotation keys. Experimental results show that the method proposed in this paper can greatly improve the computational efficiency of fully homomorphic convolution while ensuring the accuracy of action recognition. At the same time, the generation of multi-layer rotation keys also reduces communication overhead and time consumption, making the entire system more efficient. In future, in order to meet the high-precision requirements of action recognition in real life, we hope to apply FHE to more complex network structures to achieve more accurate and comprehensive action recognition. Furthermore, for cloud outsourcing scenarios with privacy leakage risks, users want to encrypt their data and upload it to the cloud for different task reasoning. In future, we will focus on studying the combination of FHE and different network models to achieve different reasoning functions.

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (62271096, U20A20157, U22A20102), Natural Science Foundation of Chongqing China (CSTB2023NSCQ-LZX0134, CSTB2024NSCQ-LZX0034), University Innovation Research Group of Chongqing (CXQT20017), Youth Innovation Group Support Program of ICE Discipline of CQUPT (SCIE-QN-2022-04), Scientific Research Foundation of CQUPT (2023071), and the Chongqing Postdoctoral Research Program Special Grant (2023CQBSHTB3092).

Author Contributions

RuyanWang:Funding acquisition, Project administration, Writing review & editing. Qinglin Zeng: Software, Writing- original draft, Visualization. Zhigang Yang: Conceptualization, Methodology, Supervision. Puning Zhang: Supervision, Writing review & editing.

Funding

National Natural Science Foundation of China, 62271096, U20A20157, U22A20102, University Innovation Research Group of Chongqing, CXQT20017, Youth Innovation Group Support Program of ICE Discipline of CQUPT, SCIE-QN-2022-04, Scientific Research Foundation of CQUPT,2023071, Natural Science Foundation of Chongqing China, CSTB2023NSCQ-LZX0134, CSTB2024NSCQ-LZX0034, the Chongqing Postdoctoral Research Program Special Grant, 2023CQBSHTB3092

Data availability

Data will be made available on request.

Declarations

Conflict of interest

The authors declare no competing interests.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Tabrizchi, H; Kuchaki Rafsanjani, M. A survey on security challenges in cloud computing: issues, threats, and solutions. J Supercomput; 2020; 76, 12 pp. 9493-9532. [DOI: https://dx.doi.org/10.1007/s11227-020-03213-1]

2. Liu J, Zhang L (2020) Indoor privacy-preserving action recognition via partially coupled convolutional neural network. In: 2020 International Conference on Artificial Intelligence and Computer Engineering (ICAICE), IEEE, pp 292–295

3. Zhang, Z; Cilloni, T; Walter, C et al. Multi-scale, class-generic, privacy-preserving video. Electronics; 2021; 10, 10 1172. [DOI: https://dx.doi.org/10.3390/electronics10101172]

4. Wu, Z; Wang, H; Wang, Z et al. Privacy-preserving deep action recognition: an adversarial learning framework and a new dataset. IEEE Trans Pattern Anal Mach Intell; 2020; 44, 4 pp. 2126-2139. [DOI: https://dx.doi.org/10.1109/TPAMI.2020.3026709]

5. Ren, B; Liu, M; Ding, R et al. A survey on 3d skeleton-based action recognition using learning method. Cyborg Bionic Syst; 2024; 5, 0100. [DOI: https://dx.doi.org/10.34133/cbsystems.0100]

6. Moon S, Kim M, Qin Z, et al (2023) Anonymization for skeleton action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 15028–15036

7. Carr T, Lu A, Xu D (2023) Linkage attack on skeleton-based motion visualization. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp 3758–3762

8. Acar, A; Aksu, H; Uluagac, AS et al. A survey on homomorphic encryption schemes: theory and implementation. ACM Comput Surv (Csur); 2018; 51, 4 pp. 1-35. [DOI: https://dx.doi.org/10.1145/3214303]

9. Pulido-Gaytan, B; Tchernykh, A; Cortés-Mendoza, JM et al. Privacy-preserving neural networks with homomorphic encryption: Challenges and opportunities. Peer-to Peer Network Appl; 2021; 14, 3 pp. 1666-1691. [DOI: https://dx.doi.org/10.1007/s12083-021-01076-8]

10. Sun, Z; Ke, Q; Rahmani, H et al. Human action recognition from various data modalities: A review. IEEE Trans Pattern Anal Mach Intell; 2022; 45, 3 pp. 3200-3225.

11. Cheng K, Zhang Y, He X, et al (2020) Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 183–192

12. Liu Z, Zhang H, Chen Z, et al (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 143–152

13. Shi L, Zhang Y, Cheng J, et al (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12026–12035

14. Sinha A, Chakravarty K, Bhowmick B, et al (2013) Person identification using skeleton information from kinect. In: Proc. Intl. conf. on advances in computer-human interactions, Citeseer, pp 101–108

15. Nair V, Guo W, O’Brien JF, et al (2023) Deep motion masking for secure, usable, and scalable real-time anonymization of virtual reality motion data. arXiv preprint arXiv:231105090

16. Zhang, L; Wang, X; Wang, J et al. An efficient fhe-enabled secure cloud-edge computing architecture for iomts data protection with its application to pandemic modelling. IEEE Internet Things J; 2023; 11, 9 pp. 15272-15284. [DOI: https://dx.doi.org/10.1109/JIOT.2023.3348122]

17. Hu J, Deng J, Wan W, et al (2020) Multi-party secure computing financial shared platform based on lightweight privacy protection under fhe. In: 2020 International Conference on Artificial Intelligence and Computer Engineering (ICAICE), IEEE, pp 245–249

18. Deviani, R; Nazhifah, SA; Aziz, AS. Fully homomorphic encryption for cloud based e-government data. Cyberspace: J Pendidikan Teknologi Informasi; 2022; 6, 2 pp. 105-118.

19. Agrawal R, Bu L, Ehret A, et al (2020) Fast arithmetic hardware library for rlwe-based homomorphic encryption (2020). arXiv preprint arXiv:200701648

20. Barrett P (1986) Implementing the rivest shamir and adleman public key encryption algorithm on a standard digital signal processor. In: Conference on the Theory and Application of Cryptographic Techniques, Springer, pp 311–323

21. Chen, H; Chillotti, I; Song, Y. Ishai, Y; Rijmen, V. Improved bootstrapping for approximate homomorphic encryption. Annual International Conference on the Theory and Applications of Cryptographic Techniques; 2019; Cham, Springer: pp. 34-54.

22. Chillotti, I; Gama, N; Georgieva, M et al. Takagi, T; Peyrin, T et al. Faster packed homomorphic operations and efficient circuit bootstrapping for tfhe. International Conference on the Theory and Application of Cryptology and Information Security; 2017; Cham, Springer: pp. 377-408.

23. Halevi, S; Shoup, V. Juan, AG; Rosario, G. Algorithms in helib. Advances in cryptology–CRYPTO 2014; 2014; Berlin, Springer: pp. 554-571. [DOI: https://dx.doi.org/10.1007/978-3-662-44371-2_31]

24. Lee, JW; Kang, H; Lee, Y et al. Privacy-preserving machine learning with fully homomorphic encryption for deep neural network. IEEE Access; 2022; 10, pp. 30039-30054. [DOI: https://dx.doi.org/10.1109/ACCESS.2022.3159694]

25. Kim, M; Jiang, X; Lauter, K et al. Secure human action recognition by encrypted neural network inference. Nat Commun; 2022; 13, 1 4799. [DOI: https://dx.doi.org/10.1038/s41467-022-32168-5]

26. Gilad-Bachrach R, Dowlin N, Laine K, et al (2016) Cryptonets: applying neural networks to encrypted data with high throughput and accuracy. In: International Conference on Machine Learning, PMLR, pp 201–210

27. Chou E, Beal J, Levy D, et al (2018) Faster cryptonets: leveraging sparsity for real-world encrypted inference. arXiv preprint arXiv:181109953

28. Brutzkus A, Gilad-Bachrach R, Elisha O (2019) Low latency privacy preserving inference. In: International Conference on Machine Learning, PMLR, pp 812–821

29. Cheon, JH; Han, K; Kim, A et al. Carlos, C; Michael, JJ et al. A full rns variant of approximate homomorphic encryption. Selected areas in cryptography–SAC 2018: 25th International Conference, Calgary, AB, Canada, August 15–17, 2018 Revised Selected Papers; 2019; Cham, Springer: pp. 347-368. [DOI: https://dx.doi.org/10.1007/978-3-030-10970-7_16]

30. Cheon JH, Kang M, Kim T, et al (2023) High-throughput deep convolutional neural networks on fully homomorphic encryption using channel-by-channel packing. Cryptology ePrint Archive, pp 2023/632

31. Juvekar C, Vaikuntanathan V, Chandrakasan A (2018) {GAZELLE}: A low latency framework for secure neural network inference. In: 27th USENIX security symposium (USENIX security 18), pp 1651–1669

32. Li, B; Micciancio, D. Anne, C; François-Xavier, S. On the security of homomorphic encryption on approximate numbers. Annual International Conference on the Theory and Applications of Cryptographic Techniques; 2021; Cham, Springer: pp. 648-677.

33. Lyubashevsky V, Peikert C, Regev O (2010) On ideal lattices and learning with errors over rings. In: Advances in Cryptology–EUROCRYPT 2010: 29th Annual International Conference on the Theory and Applications of Cryptographic Techniques, French Riviera, May 30–June 3, 2010. Proceedings 29, Springer, pp 1–23

34. Kuehne H, Jhuang H, Garrote E, et al (2011) Hmdb: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision, IEEE, pp 2556–2563

35. Kwolek, B; Kepski, M. Human fall detection on embedded platform using depth maps and wireless accelerometer. Comput Methods Programs Biomed; 2014; 117, 3 pp. 489-501. [DOI: https://dx.doi.org/10.1016/j.cmpb.2014.09.005]

36. SEAL (2019) Microsoft SEAL (release 3.4). https://github.com/Microsoft/SEAL, microsoft Research, Redmond, WA.

37. Sun K, Xiao B, Liu D, et al (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5693–5703

Word count: 9051

Show less

Cloud-based secure human action recognition with fully homomorphic encryption

Content area

Abstract

Full text