Content area
In real-world scenarios, the recognition of unknown activities poses a significant challenge for group activity recognition. Existing methods primarily focus on closed sets, leaving the task of open set group activity recognition unexplored. In this paper, we introduce the concept of open set group activity recognition for the first time and propose a novel recognition framework to deal with it. To mitigate potential scene biases, keypoints extracted from groups are utilized as input. Our framework employs a two-stage approach: Evidence Aware Collection and Evidence Aware Decision, to address the challenge of insufficient evidence for rejecting unknown classes. Specifically, encoders are established at the individual, subgroup, and group scales to collect activity evidence among group members. By applying an attention mechanism, we focus on important evidence, resulting in a set of aggregated evidence. The uncertainty estimated from evidence is then used to effectively distinguish between known and unknown classes. Additionally, we perform open set splits on two publicly available group activity recognition datasets. Experimental results demonstrate that our method shows promising performance in open set group activity recognition while maintaining comparable performance under closed set conditions.
Introduction
Group activity recognition aims to identify predefined categories of human activities in group settings. Human activity refers to actions or behaviors performed by individuals or groups of people based on video data. These activities can range from simple actions like walking and sitting to more complex interactions, such as group discussions or team sports. Thus, it finds applications in diverse fields, such as sports events and public safety. However, the limitations of predefined categories become evident during the training process, as they often fail to encompass all types of activities encountered in real-world scenarios. For instance, the CAD dataset [1] in street scenes exemplifies this constraint by defining only five group activity categories: crossing, walking, waiting, talking, and queueing. Such predefined categories inadequately capture the complexity of real-life street situations.
Despite the substantial progress achieved by existing group activity recognition models [2, 3, 4, 5, 6, 7, 8–9], a persistent challenge is discernible in identifying activities not part of the training data. Figure 1 illustrates the disparity between open set and closed set scenarios in group recognition tasks, underscoring the limitation of models traditionally trained in closed set environments to adapt effectively to real-life applications. In a closed set scenario, the model is trained on a fixed set of categories, and it assumes that any encountered activity belongs to one of these predefined categories. Conversely, in an open set scenario, the model must be capable of recognizing activities that were not present in the training data, treating them as novel or unknown categories. Consequently, a critical need arises for the design of a group activity recognition model operating within an open set environment, acknowledging the inherent limitation of traditional methods in addressing unknown categories. In this context, open set group activity recognition is defined as the computational task of identifying and categorizing collective behaviors within a group, specifically addressing the challenge posed by unknown activities not predefined during the training process. This novel approach aims to enhance the adaptability of group activity recognition models in dynamic and diverse real-world environments.
In recent years, traditional open set recognition methods have been applied in the field of action recognition. Y.Shu et al.[10] proposed ODN, a method that dynamically updates deep networks incrementally for recognizing new classes. Yoon et al.[11] introduced STRM, which utilizes spatiotemporal feature similarity for comparing and updating a action gallery to recognize new classes. However, these methods require a continuous increase in new categories, leading to changes in network structure or an increase in storage costs. Additionally, Bao et al.[12] and Zhao et al.[13] used an evidential learning approach, mapping extracted features to a Dirichlet distribution to calculate uncertainty. However, this method relies solely on global features to measure classification evidence and does not comprehensively understand human activity across different views. Therefore, considering the practical significance of group activity recognition and the uncertainty in open set spaces, our ultimate goal for open set group activity recognition is only to identify known and unknown classes. We no longer explore the specific categorization of new classes.
Fig. 1 [Images not available. See PDF.]
A comparison of group activity recognition under the closed set and open set conditions.(a) illustrates the closed set situation where the group activity categories in the test set are a subset of the train set. All the testing categories are known to the recognition model. On the other hand, (b) depicts the open set scenario where the train set only represents a subset of the entire space of group activity. As a result, the testing phase may introduce new classes that are unknown to the model
In this paper, we extend the group activity recognition task from a closed set to an open set environment and propose an Open Set Group Activity Recognition (OSGAR) method. To enable the model to distinguish between known and unknown classes, we follow the evidential learning approach. We calculate the evidence of the input video belonging to each group activity category using the Dirichlet distribution, and directly output the uncertainty results without the need for complex operations. To enhance the evidence difference between known and unknown classes, we design a two-stage method that leverages the characteristics of group activities. Firstly, we embed the skeleton features and input them into the Evidence Aware Collection, which comprehends the activity structure in the video at individual, subgroup, and group scales. By segmenting the group activity, we improve the focus and coordination of the network at each scale. Secondly, the Evidence Aware Decision module performs evidence fusion based on attention mechanism across the three sets of evidence and infers more comprehensive uncertainty results. This ensures that the rejection of unknown classes is grounded in the contribution from each perspective of group views. Furthermore, existing public datasets for group activity recognition do not meet the requirements of training and testing models in an open set environment. Therefore, we conduct open set splits on two datasets based on activity categories, defining known and unknown classes according to certain proportions.
In summary, the contributions of this paper are summarized as follows:
We propose a novel framework for group activity recognition that is capable of identifying known and unknown classes through highly discriminative uncertainty, thus catering to an open set environment. To the best of our knowledge, this is the first research to recognize unknown group activity categories.
We propose a two-stage method, Evidence Aware Collection (EAC) and Evidence Aware Decision (EAD), effectively addresses the issue of insufficient evidence on group activity recognition task and enhances the reliability of uncertainty calculation.
We conduct experiments on public datasets by partitioning them into open set scenarios, and the results demonstrate that our method not only achieves promising recognition performance in open set environments but also maintains high accuracy in closed set recognition.
Related work
Group activity recognition
In recent years, researchers in the field of group activity recognition have made significant progress. Early methods typically involved extracting hand-crafted features and then employing probabilistic graphical models [14, 15, 16–17] or an AND-OR grammar approach [18, 19]. With the advancement of deep learning, approaches combining convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been widely applied in group activity recognition [20, 21, 22, 23, 24, 25, 26–27]. Specifically, CNNs are used to extract static features at a person-level, while RNNs are employed to capture the temporal dynamics of each individual and model the interaction context. For instance, Wang et al. [23] developed a hierarchical recurrent framework to handle higher order contexts, including individual dynamics as well as interactions within and between groups. Tang et al. [24] proposed a teacher-student network that incorporates semantic information into the visual domain, enabling better identification of key individuals. Furthermore, inspired by the success of Graph Neural Networks (GNNs), some attempts have been made to explore the interactions in group activity recognition using similar ideas [3, 7, 28, 29, 30, 31, 32, 33, 34, 35–36]. Wu et al.[3] proposed a flexible Actor Relation Graph (ARG) to capture the appearance and spatial relationships between actors, providing an interpretable mechanism for explicitly modeling the interactions among participants. Wang et al. [4] constructed 3D-unified spatial-temporal graphs to capture the variations in individual postures from different perspectives. This approach effectively addresses issues caused by occlusions and varying camera angles, overcoming limitations of 2D methods. Recently, the Transformer [37] has shown promising results in various computer vision domains, and researchers have also attempted to apply it to the task of group activity recognition [5, 6–7, 38, 39, 40–41]. Han et al. [42] proposed a DualAI framework that can flexibly arrange spatiotemporal Transformers in two complementary sequences. Zhou et al. [8] designed a Composer for compositional reasoning of group activities, optimizing the multi-scale representations learned by the Transformer. Furthermore, Zhu et al. [43] introduced MLST-Former, which explores the spatiotemporal dependencies between different actors from multiple perspectives and selectively aggregates them across frames. Considering the issue of annotation cost in practical applications, research in group activity recognition has gradually shifted toward weakly supervised or unsupervised methods. Kim et al. [6] proposed the first weakly supervised group activity recognition model, combining multiple learnable tokens to form cues for group behavior. Du et al. [9] explored local relative motion and deep contextual information to reduce reliance on labels. They further designed a spatiotemporal contrastive loss [44] to extract discriminative group behavior features, achieving self-supervised learning. However, the aforementioned methods all assume a fixed number of target group activity classes in their experiments, which is not suitable for real-world applications. Therefore, there is a need to develop a group activity recognition model that can operate in an open set environment and possess the capability to distinguish between known and unknown classes.
Open set recognition
The open set problem was first introduced by Fayin et al. [45] to address face recognition tasks. In order to reject unknown classes, Scheirer et al. [46] proposed a binary support vector machine (SVM), and based on this method and extreme value theory, P-SVM [47] was introduced to calibrate class confidence scores. With the success of deep learning, Deep Neural Networks (DNNs) have been widely applied in Open Set Recognition (OSR). Bendal et al. [48] proposed Openmax, which estimates the probability of inputs belonging to unknown classes. Similarly, adversarial learning methods have also been introduced in OSR [49, 50]. Yang et al. [51] proposed an adversarial learning approach based on negative sample features and then used paired Weibull distributions to identify unknown classes. Additionally, methods that utilize reconstruction errors for more effective rejection of unknown classes have also been proposed [52, 53–54]. Apart from simple image classification tasks, open set recognition methods have also been developed in the field of action recognition. To deal with unknown behavior classes in real-world scenarios, Roitberg et al. [55]proposed a voting-based approach that uses estimated uncertainty of action prediction to measure the novelty of test samples. DEAR [12] addresses Open Set Action Recognition (OSAR) by estimating the uncertainty of actions to differentiate between known and unknown samples. MULE [13] introduced Beta-ENN to better fit distributions with multiple behavior classes. However, the aforementioned methods have mainly been proposed for open set image classification and simple action recognition tasks. In the context of open set group activity recognition, which is more challenging than OSR and OSAR problems, there has been no existing research. Therefore, inspired by OSR and OSAR methods, we propose a solution for this problem for the first time,.
Uncertainty estimation
To distinguish between known and unknown classes, most methods utilize the uncertainty of sample predictions and determine their membership based on a predefined threshold. In light of this, Kendall et al. [56] proposed Bayesian Neural Networks (BNNs) to estimate uncertainty by approximating the moments of the posterior predictive distribution. However, considering the computational cost of sampling in BNNs during inference and the difficulty of obtaining the posterior distribution, Evidential Neural Networks (ENNs) [57] were introduced. ENNs place Dirichlet distributions on predicted class probabilities and collect evidence for making such predictions through neural networks. Nevertheless, while this approach is applicable in simple image classification tasks, when faced with the semantic complexity of group activity recognition, solely collecting evidence from final prediction results leads to insufficient evidence.
Fig. 2 [Images not available. See PDF.]
An overview of our proposed method. Our OSGAR method begins by extracting the human skeleton using a pose estimation network HRnet [58]. The embedded keypoint feature is then passed to a two-stage network. In the first stage, called Evidence Aware Collection, three Transformer encoders are employed to collect evidence features at three different scales. In the second stage, called Evidence Aware Decision, the evidence values obtained in the previous stage are aggregated, and the sample’s uncertainty value is calculated to determine its membership in either the known or unknown class. At final stage, the model retains its ability for closed set multi-class classification and can also provide the specific class for known samples
Method
Our proposed OSGAR method, as shown in Fig. 2, takes video samples as input and estimates their uncertainty to determine the familiarity of group activities in the videos. Specifically, OSGAR only utilizes embedded skeleton features as input and conducts a two-stage process. In the first stage, called Evidence Aware Collection (EAC), different semantic features are extracted at three scales to generate three sets of evidence values. Then, in the second stage, the Evidence Aware Decision (EAD) module selectively aggregates the evidence based on attention mechanism to obtain the final uncertainty result, which can effectively distinguish between known and unknown classes.
Keypoint feature embed
The recognition methods based on RGB often come with background noise [12, 59, 60], which can mislead open set judgment. Therefore, we introduce skeleton data that focuses more on human motion, eliminating background bias and making the computation more lightweight [8, 61, 62–63]. Additionally, compared to the single input of skeleton information [8, 61, 64], the dual input of RGB and skeleton [4, 63, 65] not only increases the computational complexity but also does not significantly improve accuracy, and even has the potential to decrease it [8]. Considering the demand for model generalization in open set scenarios, we ultimately choose to use skeleton features as the sole input for our model. The initial skeleton information is extracted using the mature pose estimation network HRnet [58], which includes the positions and types of human joints in video frames. To further enrich the input features, we embed some latent prior information, such as joint types, temporal positions, and spatial positions, into the original skeleton. This results in a joint vector for each individual. The process can be formalized as follows:
1
where represents the keypoint for the -th individual, joint type , in the -th frame. represents Feed Forward Network, and is the concatenation function. is the learned vector representation of joint type through GCN, is the normalized and standardized spatial position encoding for the -th individual, and is the learned vector obtained by the learned absolute positional encoding for time .Evidence aware collection
Due to the direct impact of evidence obtained from feature inference on the estimation of sample uncertainty, it is crucial to collect as much potentially helpful evidence as possible. Traditional evidential learning methods rely only on the final output of the feature extractor. However, in the case of group activities, there may exist highly similar classes, such as “pass” and “set” in volleyball. If one of these classes is unknown, using traditional methods can easily lead to confusion between known and unknown classes. Since the aforementioned traditional methods overlook the potential evidence present at different stages of the network and different semantic levels, we design a two-stage evidence processing approach. In the first stage, considering the multi-scale nature of group activities, the group activity is decomposed into three scales: individual, subgroup, and group. An Evidence Aware Collection is then employed to gather evidence features from each scale. The multiple evidence obtained from these features embeds more characteristics related to the combination and interaction of group activities, thus increasing the reliability of the final classification.
In the task of group activity recognition, individuals serve as the smallest units of analysis. The key actions of certain individuals can directly impact the final results. Therefore, while gradually combining individual behaviors to construct the exhibited group activity in a bottom-up manner, it is also necessary to focus on individual-level features and collect evidence to assist in determining sample uncertainty. Firstly, we need to combine the embedded skeleton vectors to form individual features. This process can be formalized as:
2
where denotes feature representation for the -th individual in the -th frame. is Concatenation function and is the total number of joint types. Next, we refine the spatial evolution clues of individuals by employing a Transformer with the time dimension as the batch dimension. Given the individual features, denoted as , this process can be formalized as:3
where is the total number of individuals, represents the total number of frames, is the channel dimension and is refined individual-level feature. is Individual-wise Transformer encoder layer with a structure similar to that of the Transformer encoder [37]. In this way, temporal and contextual dependencies across individuals and frames are effectively modeled.Regarding the analysis of group activity, several researchers [66, 67–68] have found that complex group activities can be differentiated into multiple subgroups for better understanding. By exploring the connections between these subgroups, we can infer the activity category of the entire group. Therefore, evidence at the subgroup level plays a crucial role in identifying unknown activity categories. Based on this, it is necessary to incorporate subgroup-level evidence in addition to individual-level and overall evidence to provide complementary information. The strategy for dividing subgroups is critical. Considering computational costs, a subgroup division strategy grounded in inter-individual similarity has been selected. Initially, the similarity matrix between individuals is computed through matrix multiplication, utilizing the previously acquired individual features. Then, we employ the k-means [69] based on the similarity matrix to divide the individuals into M subgroups. This process can be formalized as:
4
5
where represents the similarity matrix between individuals, denotes the features of the -th subgroup. signifies the transpose operation on a matrix, while indicates setting the diagonal elements of a matrix to zero to eliminate the influence of high self-similarity on the subsequent clustering step. represents whether the -th individual belongs to the -th subgroup, which is obtained through k-means clustering based on . represents the learnable weights of the . Similarly to the individual-level features, the spatiotemporal information between subgroups is captured utilizing a Transformer encoder. This process can be formalized as:6
where represents the feature evidence of the refined subgroups, and refers to the Subgroup-wise Transformer encoder, which has the same structure as the Transformer encoder [37]. denotes the number of subgroups and also clustering centers.Finally, we combine the individual features and subgroup features to generate a global representation of the group’s characteristics. In a parallel manner, a Transformer encoder [37] is employed to capture the underlying correlations at both scales. This process can be formalized as:
7
Here, represents the group features, and refers to the Group-wise Transformer encoder. The structure of the Group-wise Transformer encoder remains consistent with the Transformer encoder[37]. represents the learnable weights. The Transformer encoder structures at the three scales all include multi-head self-attention mechanisms. By utilizing multiple attention heads, they can capture the correlation representations between different individuals within the same scale. This enables them to exhibit better generalization ability when faced with open condition.Evidence aware decision
In closed set recognition, the model is trained to recognize a predetermined set of classes denoted as . The classifier function f(x) is designed to assign an input x to one of these predefined classes according to the following mathematical representation:
8
Conversely, open set recognition addresses scenarios where the model encounters classes during testing that were not present during training. In this context, the model must distinguish between known classes (closed set) and unknown classes. Let denote the set of known classes, and represent the set of unknown or novel classes. The classification function f(x) accommodates this open set scenario through the following formulation:9
where f(x) assigns the input x to one of the known classes if it belongs to ; otherwise, it labels it as unknown. These open set decision theory is equally applicable in the task of group activity recognition. Furthermore, we leverage evidential learning to improve the accuracy of our decision-making.In traditional evidential learning, only the global features from the last layer are used to infer sample evidence. However, in complex group activities, global features alone are insufficient to distinguish known classes from unknown classes. Insufficient evidence often leads to incorrect results when unknown classes exhibit high similarity to the known classes during the testing process. To address this issue, we propose a novel evidence decision-making method that utilizes an attention mechanism to selectively focus on the various evidence features collected in the first stage. This method then computes a discriminative sample uncertainty, effectively rejecting unknown classes.
Firstly, the evidence for the three sets of collected features is computed separately. Taking the individual features as an example, this process can be formalized as:
10
where represents the evidence value for the individual-level features. is the activation function such as softplus or ReLU. refers to the deep neural network parameterized by . represents the number of known group activity classes. The calculation of subgroup evidence and group evidence follows the same way. To reinforce the auxiliary effect between evidence values at different scales and learn the ability to focus on more important part of evidence, we propose a dot product attention mechanism applied to the evidence values. Compared to other attention mechanisms, it possesses lower computational complexity, ensuring efficiency in open set environments. Additionally, the dot product operation directly measures the similarity between multiple evidence, better preserving the important features in the evidence values. This process can be formalized as:11
12
where represents different evidence, , , are three 1 1 convolutional layers with filters, is the attention weight which indicates the contribution of evidence to the group evidence . Subsequently, the output of the attended representation is multiplied by a scale parameter , and the initial group evidence is added back. Following this, the final evidence is derived by applying a nonlinear activation function to .In evidential learning, based on the Dempster-Shafer Theory (DST) [70] and the subjective logic (SL) [71], when considering a sample for C-class classification, it is assumed that the class probability follows a prior Dirichlet distribution parameterized by . The value of is directly connected to the learned evidence . we can calculate the sample’s uncertainty by this distribution. This process can be formalized as:
13
14
where represents the uncertainty of the sample, is the parameter of the Dirichlet distribution, and is the total number of known group activity categories. Upon acquiring the final uncertainty of the samples, a comparison is made between the uncertainty and a predefined threshold. If the uncertainty is greater than the threshold, we classify the sample as an unknown activity; otherwise, it is considered a known activity and then give its specific classification. In determining the threshold, we refer to the method proposed in [12, 13, 72], which ensures that 95% of the training data is recognized as known.To simultaneously address multi-class classification and uncertainty modeling, we employ the EDL loss [12], which is defined as follows:
15
where represents the one-hot label vector of dimension , represents the evidence values of dimension , and represents the total length of the Dirichlet distribution, which is computed as . The loss function increases the evidence values for correct classification by calculating cross-entropy and suppresses the evidence for incorrect classification. In the evidence decision stage, multiple sets of evidence values generated are optimized using the EDL loss. Depending on the use of the attention mechanism, this process can be categorized into two types of losses. The formalization of this process is as follows:16
17
where is used to constrain the initial evidence computed at three different scales, while is used to constrain the final evidence generated through the attention mechanism. Additionally, to ensure the model’s closed set reasoning capability, the global group feature outputted after the evidence collection stage is utilized to generate class probabilities of dimension . Then, the closed set accuracy loss is computed by cross-entropy. This process can be formalized as follows:18
where represents the labels and represents the closed set predictions, the total loss of our OSGAR model can be defined as follows:19
where , , serve as hyperparameters.Experiments
Datasets
We apply experiments on two datasets which are widely used for group activity recognition, the Volleyball dataset [73] and the Collective Activity dataset (CAD) [1]. The Volleyball dataset is formed by 55 videos of volleyball game. The dataset is divided into two groups. One is used to make training sets of 39 video clips, cropped to 3,493 video frames. The other is a test sets, with 16 video clips cropped into 1,337 video frames. There are 8 group activity labels (right set, right spike, right pass, right winpoint, left set, left spike, left pass and left winpoint). Each fragment is cut into 41 frames. Only the frames in the middle part provide the complete group activity category, actor action category and individual bounding box. In [20], they add labels and bounding boxes to the remaining frames and we completed the experiment based on the latter. The CAD [1] consists of 44 videos sequences. There are 5 group activity labels (crossing, waiting, queueing, walking and talking). The activity category of a group is determined by the majority in its final actor action category. We use the same evaluation as in stageNet [25] and choose 1/3 of the videos for testing and the rest as the training set. Overall, the detailed information of these two datasets is presented in Table 1. However, considering the evaluation of the model’s performance in an open set environment, we adjust the dataset splits to create Open Set Group Activity Recognition (OSGAR) datasets. The specific details of these split strategies are presented in the following section.
Table 1. Dataset detailed information
Dataset | Number of videos | Number of clips | Frames per clip | Number of activity classes | Activity speed |
|---|---|---|---|---|---|
Volleyball | 55 | 4,830 | 41 | 8 | fast |
CAD | 44 | 2,511 | 10 | 5 | slow |
Table 2. Dataset open set split method
Split method | Dataset | Known class () | Unknown class () |
|---|---|---|---|
V1-openset | Volleyball | l-winpoint, r-winpoint | l-set, r-set |
l-spike, r-spike | |||
l-pass, r-pass | |||
CAD | crossing | walking | |
waiting | queueing | ||
talking | |||
V2-openset | Volleyball | l-spike, r-spike | l-winpoint, r-winpoint |
l-pass, r-pass | |||
l-set, r-set |
Open set settings
Datasets split strategy
To build datasets suitable for open set evaluation, we adopt a split strategy similar to the one used in [13]. The videos are classified into two groups: and , each representing different classes. The activities within correspond to the known classes, while those within represent the unknown classes. We utilize activities from for training purposes and activities from the combined set for testing. For instance, in the volleyball dataset, four pairs of activity classes are present which both consist of left side and right side. Among these, three pairs are set as and one pair is set as . In the CAD dataset, three out of the five activity classes are defined as , while the remaining two are defined as . Specifically, we employ two different split methods, as shown in Table 2. In V1-openset, emphasis is placed on activities characterized by high similarity. For instance, in the volleyball dataset, “pass" and “set" have high similarity, making it difficult to distinguish them in closed set recognition. Therefore, in open set recognition, we define “set" as an unknown class and “pass" as a known class. The same principle applies to the CAD dataset. On the other hand, V2-openset is based on a larger discrepancy between known and unknown classes. In the volleyball dataset, the “winpoint" activity, which has distinct differences from others, is defined as an unknown class. In CAD, since the existing five classes have some similarities, we do not further split them. Therefore, experiments based on the V1-openset method can evaluate model’s ability to distinguish highly similar group activities, while experiments based on the V2-openset method can evaluate its ability to handle completely unlearned samples.
Evaluation protocols
For evaluating the performance in a closed set scenario, the conventional approach of calculating the top-1 accuracy (Acc) for group activity classification is adhered to. However, in the case of open set performance assessment, we adopt the established open set recognition protocol [74, 75]. This protocol utilizes the uncertainty score obtained from the model to calculate three key metrics: AUROC (Area Under the Receiver Operating Characteristic curve), AUPR (Area under the Precision-Recall curve), and FPR95 (False Positive Rate at 95% True Positive Rate). AUROC measures the classifier’s ability to distinguish between true positives (TPR) and false positives (FPR). An ideal classifier achieves an AUROC score of 1. AUPR quantifies precision and recall performance, illustrating the relationship between these two measures. FPR at 95% TPR represents the probability of incorrectly classifying a new example as a known one when the true positive rate is at 95%. While we do consider closed set accuracy as a reference, our main focus lies in the recognition of open set group activities.
Implementation details
For a fair comparison, we resize frames in the Volleyball dataset to and frames in the CAD dataset to . We utilize T = 10 frames as the input to our model during both training and testing on both datasets. Human keypoints are estimated using HRNet [58] with 17 distinct keypoint types ( = 17). Our model use three Transformer encoders, each with two attention heads and a dropout rate of 0.2. The MLP layer had a dimension of 1024 with ReLU activation, and the hidden dimension is set to 256. The hyperparameters , and are set to 0.1, 0.5 and 1. The cluster number is set to 2. The model is trained for a total of 100 epochs using the Adam optimizer [76] with an initial learning rate of 0.001. After 40 epochs, the learning rate is decreased to 0.0001. The weight decay is 0.001 and batch size is 128. All models are trained using 2 NVIDIA RTX 4090 GPUs.
Evaluation results
The existing models for group activity recognition (GAR) are limited to closed set recognition capabilities. In order to facilitate open set recognition, it is necessary to incorporate additional code that distinguish between known classes and unknown classes into GAR models. In recent studies [4, 8, 43, 68, 79, 80], it has been observed that the recognition capabilities on two datasets are consistently similar. Since most of them do not provide open-source code, we have chosen to use Composer [8] as the state-of-the-art method for closed set group activity recognition, as it has publicly available code. Additionally, several classical group activity recognition models are incorporated, namely ARG [3], DIN [29], and Groupformer [5]. Considering the usefulness of weakly supervised methods for practical applications, we have also selected DFWSGAR [6], a method with open-source code, from this category. In summary, the selection of five GAR models spans a spectrum of methods, encompassing graph neural networks, transformer-based approaches, and weakly supervised methods, offering a comprehensive foundation for comparison. As illustrated in Table 3, we compare all GAR models by pairing them with four different open set recognition methods, evaluating their performance against our OSGAR method.
Table 3. Comparison with different methods
GAR model | Open set method | Volleyball | CAD | ||||||
|---|---|---|---|---|---|---|---|---|---|
AUROC | AUPR | FPR95 | ClosedSet Acc | AUROC | AUPR | FPR95 | ClosedSet Acc | ||
ARG [3] | RPL [77] | 60.99 | 48.34 | 59.21 | 89.90 | 58.32 | 40.12 | 71.33 | 86.34 |
SoftMax [74] | 62.64 | 49.12 | 64.76 | 91.23 | 62.33 | 43.61 | 63.44 | 90.88 | |
BNN SVI [78] | 64.63 | 50.10 | 58.39 | 90.24 | 65.31 | 52.33 | 58.97 | 87.23 | |
DEAR [12] | 68.43 | 55.34 | 57.23 | 89.72 | 67.23 | 51.22 | 58.76 | 85.67 | |
DIN [29] | RPL [77] | 62.21 | 47.76 | 58.39 | 90.17 | 60.23 | 43.69 | 68.35 | 93.22 |
SoftMax [74] | 62.11 | 49.67 | 57.17 | 92.40 | 63.59 | 50.22 | 60.17 | 94.00 | |
BNN SVI [78] | 66.86 | 52.33 | 58.66 | 89.72 | 64.73 | 50.23 | 59.28 | 90.11 | |
DEAR [12] | 66.90 | 51.29 | 60.35 | 90.13 | 68.23 | 51.36 | 52.18 | 89.61 | |
Groupformer [5] | RPL [77] | 69.73 | 49.78 | 54.89 | 94.11 | 72.45 | 53.45 | 52.29 | 93.22 |
SoftMax [74] | 71.23 | 50.10 | 51.96 | 94.50 | 72.25 | 60.55 | 49.35 | 96.27 | |
BNN SVI [78] | 73.88 | 62.32 | 48.66 | 93.21 | 80.19 | 61.27 | 47.12 | 94.28 | |
DEAR [12] | 72.23 | 52.39 | 50.03 | 93.33 | 75.32 | 59.37 | 40.25 | 92.23 | |
Composer [8] | RPL [77] | 73.23 | 49.89 | 52.34 | 90.67 | 70.28 | 49.77 | 51.23 | 94.31 |
SoftMax [74] | 80.21 | 52.22 | 49.78 | 94.61 | 78.23 | 53.46 | 50.30 | 96.18 | |
BNN SVI [78] | 83.34 | 54.32 | 44.63 | 93.21 | 79.33 | 56.78 | 48.45 | 95.98 | |
DEAR [12] | 84.36 | 65.79 | 41.33 | 92.24 | 82.24 | 64.31 | 40.98 | 92.30 | |
DFWSGAR [6] | RPL [77] | 69.23 | 47.80 | 59.34 | 90.12 | / | / | / | / |
SoftMax [74] | 78.62 | 53.21 | 49.78 | 90.02 | / | / | / | / | |
BNN SVI [78] | 82.31 | 50.32 | 44.96 | 89.35 | / | / | / | / | |
DEAR [12] | 80.24 | 59.74 | 44.80 | 87.68 | / | / | / | / | |
OSGAR | 89.21 | 72.81 | 32.63 | 92.35 | 87.62 | 73.64 | 34.19 | 94.44 | |
Note: All metrics are in , and the best results are in bold
The experimental results presented in the table are obtained using the V1-openset split method on two datasets. As for the four open set methods, RPL [77] and SoftMax [74] rely on confidence for judgment, while BNN SVI [78] and DEAR [12] rely on uncertainty for judgment. Experimental results demonstrate that methods based on uncertainty often achieve better AUROC. This superiority can be attributed to the Dirichlet distribution’s alignment with classification in open set task. Notably, using the GAR model with Softmax generally maintains the highest closed set accuracy. Among the five group activity recognition models, ARG [3], DFWSGAR [6] and DIN [29] are based on RGB data for recognition, while Groupformer [5] and Composer [8] are based on skeleton information. The experimental results reveal that skeleton-based methods offer certain advantages over RGB-based models. This is because RGB data often contains scene biases influenced by the backgrounds of the videos. To address this limitation, we propose OSGAR, which combines the benefits of skeleton information and uncertainty-based methods. Furthermore, in comparison to Groupformer [5] and Composer [8], which also utilize attention mechanisms, our OSGAR method employs multi-head self-attention mechanisms to extract evidence features. This approach enables better capture of long-range semantic features across various interaction levels. By combining them with dot product attention, the method effectively concentrates on key information within multiple evidence, thus offering an initial solution for group activity recognition in an open set condition. OSGAR demonstrates superior performance across all three metrics while maintaining high accuracy in closed set recognition. The success of OSGAR can be attributed to the following reasons: a) OSGAR effectively captures intrinsic semantic information of group behavior by modeling at three different scales. b) By incorporating an attention mechanism, OSGAR aggregates multiple pieces of evidence in a complementary manner, enhancing the reliability of the final classification results.
Fig. 3 [Images not available. See PDF.]
OOD detection by confidence and uncertainty. In the V1-openset volleyball dataset, we compare three open set methods based on Composer [8] with our OSGAR by the density of samples’ confidence or uncertainty from them
Fig. 4 [Images not available. See PDF.]
Visualization of prediction scores using different open set recognition methods on split Volleyball datasets. In the presented figure, the model predictions on the volleyball dataset for the V1-openset and V2-openset split methods are illustrated. The upper and lower parts of the graph depict the results. The bar graph represents the confidence values or uncertainties of the predictions. The SoftMax approach relies on confidence for decision-making, denoted by gray bars. On the other hand, the BNN SVI and OSGAR methods make decisions based on uncertainties, represented by green bars. Any misclassified predictions are denoted by red crosses. Additionally, the same subgroups identified during the EAC stage are highlighted with rectangular boxes of the same color in the images. While the threshold values may vary depending on the model, we have merged the thresholds obtained from BNN SVI and OSGAR as they are quite similar. All thresholds are represented by dashed lines in the bar graph
In addition, to better demonstrate the discriminative capabilities of different methods on open set samples, we conduct Out-of-Distribution (OOD) detection on two kinds of Volleyball datasets: one with known classes (in-distribution or IND) and the other with a mix of known and unknown classes (out-of-distribution or OOD). The two datasets were fed into separate models. Their performance is compared by visualizing the distributions of sample scores, encompassing confidence and uncertainty, as depicted in Fig. 3. In (a) and (b), the identification of unknown classes is based on a confidence threshold, where samples below the threshold are classified as unknown. In (c) and (d), classification is conducted using uncertainty, with samples above the threshold categorized as unknown. Besides, we observe that our method captures higher uncertainty in OOD samples compared to traditional uncertainty measures. This better represents the high uncertainty of unknown classes.
Next, we visually present the test results of three representative open set methods: Softmax based on confidence, BNN SVI based on uncertainty, and our OSGAR, on the Volleyball dataset under two open set splits, as shown in Fig. 4. For the confidence-based methods, samples with scores above the threshold are considered as known classes, while those below the threshold are considered as unknown classes. The opposite is true for uncertainty-based methods. The thresholds for both methods are determined by the point on the AUROC curve closest to (0,1). In the V1-openset split, the “set" class is considered as an unknown class. Due to its high similarity to the “pass" class, traditional methods mistakenly identify “set" as a known class. However, our OSGAR successfully identifies “set" as an unknown class by integrating multi-scale evidence. In the V2-openset split, the “winpoint" class is considered as an unknown class because it exhibits completely different behavior compared to the other three classes, aligning with the sudden and scattered characteristics of open set scenarios. In this test set, the Softmax method still fails to recognize “winpoint". Although BNNSVI shows higher uncertainty than other samples, it does not reach the threshold. However, our OSGAR successfully differentiates “winpoint" and exhibits significantly higher uncertainty in its predictions, demonstrating its excellent discriminative performance. Furthermore, the other two methods also exhibit some biases in predicting known classes, highlighting the limitations of traditional open set methods in adapting to group activity recognition tasks. Additionally, we also visualize the two subgroups after segmentation (M=2) on each sample, where rectangles of the same color represent individuals from the same subgroup. From the results, it can be observed that individuals in the yellow subgroup frequently engage in key interactions, while the blue subgroup primarily focuses on momentary actions.
Ablation study
Comparison with subgroup division methods
By comparing three different methods of dividing subgroups, we validate the effectiveness of our subgroup division in Table 4. The position-based method directly divides them based on spatial information, which produces coarse result. The feature-based method mainly aggregates features from multiple individuals, which introduces excessive redundancy. Both of these methods perform poorly in terms of open set metrics and closed set accuracy because they inherently weaken the accuracy of evidence brought by subgroup scale. However, our similarity-based method clusters subgroups through their similarity based on features, achieving better balance.
Effectiveness of multiple evidence and their aggregation methods
As shown in Table 5, we varied the evidence input during the EAD stage to test the model’s open set recognition capability. It is observed that incorporating more evidence improves model performance by providing additional information about individuals in the scene. In terms of AUROC, the best performance was achieved by using evidence from all three scales for joint prediction. This approach showed a 7.76% improvement compared to using evidence from only one scale and a more than 5% improvement compared to using evidence from two scales. When allowing the introduction of only one additional type of evidence, the inclusion of subgroup evidence demonstrated an advantage over individual evidence. Overall, using multiple scales contributes to a hierarchical understanding of group activities, providing richer evidence for open set prediction. Furthermore, we compared different evidence aggregation methods. “MAX" indicates calculating the uncertainty for each scale separately and selecting the maximum value, while “Sum" represents directly summing the multiple evidence to predict uncertainty. Our method employs attention-based aggregation. As shown in Table 6, in terms of AUROC, our method outperforms “Sum" n by 3.89% and “MAX" by 6.91%. The reason for this improvement is that the “MAX" method treats the potential unknown characteristics in the samples extremely, excessively amplifying the attention on unknown samples. On the other hand, the “Sum" method exhibits some improvement over “MAX" by starting to balance the multiple evidence, but its effectiveness is inferior to our attention-based method due to the lack of weighted consideration for key evidence.
Table 4. Results on Volleyball in terms of different subgroup division methods
Division method | AUROC | AUPR | FPR95 | Acc |
|---|---|---|---|---|
Position-based | 88.16 | 69.10 | 34.27 | 90.03 |
Feature-based | 88.26 | 69.77 | 36.76 | 90.11 |
Similarity-based | 89.21 | 72.81 | 32.63 | 92.35 |
Note: All metrics are in , and the best results are in bold
Table 5. Results on Volleyball in terms of different evidence
Individual | Subgroup | Group | AUROC | AUPR | FPR95 | Acc |
|---|---|---|---|---|---|---|
81.45 | 63.88 | 47.54 | 89.76 | |||
83.55 | 67.23 | 40.29 | 91.56 | |||
83.76 | 66.48 | 43.56 | 91.28 | |||
89.21 | 72.81 | 32.63 | 92.35 |
Note: All metrics are in , and the best results are in bold
Contribution of our loss
As shown in Table 7, we conducted experiments on the volleyball dataset by removing the and loss functions from OSGAR. The experimental results indicate that removing the constraint of attention mechanism in the EAD stage severely affects the reliability of the final uncertainty. Additionally, incorporating the loss on top of the loss enhances the effectiveness of the evidence collected in the EAC stage to some extent.
Conclusion
In this paper, we introduce the concept of open set group activity recognition and propose a framework OSGAR as a solution for recognizing group activities in real-world scenarios. Our OSGAR models group activities at different scales, enhancing the hierarchical understanding of group behaviors and emphasizing the advantage of subgroups. The proposed two-stage methods, Evidence Aware Collection and Evidence Aware Decision, aggregate evidence from different scales to provide reliable uncertainty predictions, effectively addressing the challenges of open set scenarios. Moreover, we perform open set splits on existing datasets and demonstrate that our method achieves good performance in distinguishing known and unknown classes while maintaining high closed set recognition ability. Overall, our OSGAR framework makes valuable contributions to group activity recognition in open set environments.
Table 6. Comparison of evidence aggregation methods
Aggregation method | AUROC | AUPR | FPR95 |
|---|---|---|---|
Max | 82.30 | 67.82 | 44.95 |
Sum | 85.32 | 68.22 | 39.76 |
Ours | 89.21 | 72.81 | 32.63 |
Note: All metrics are in , and the best results are in bold
Table 7. Test on the effects of different loss
Method | AUROC | AUPR | FPR95 |
|---|---|---|---|
OSGAR (w/o ) | 87.64 | 70.21 | 38.11 |
OSGAR (w/o ) | 65.53 | 58.27 | 49.57 |
OSGAR (full) | 89.21 | 72.81 | 32.63 |
Note: All metrics are in , and the best results are in bold
Limitations
This study provides an initial exploration of the open set problem in group activity recognition. Future research could focus on capturing helpful contextual information from activity scenes to assist in decision-making. Additionally, there is a need for diverse and publicly available datasets specifically designed for open set group activity recognition. Researchers are encouraged to contribute in this area by creating such datasets to facilitate further advancements.
Acknowledgements
This work was supported by National Key R &D Program of China (2022YFB4501600).
Author Contributions
LZ: Conceptualization, resources. SW: Methodology, writing—original draft preparation, visualization, writing—reviewing and editing. XC: Visualization, investigation. YY: Supervision. XL: Validation.
Data availability statement
The Volleyball dataset can be obtained from https://github.com/mostafa-saad/deep-activity-rec, while the CAD dataset is available at https://vhosts.eecs.umich.edu/vision//activity-dataset.html. Our code will be made available on request.
Declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Choi, W., Shahid, K., Savarese, S.: What are they doing?: Collective activity classification using spatio-temporal relationship among people. In: 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops, pp. 1282–1289. IEEE (2009)
2. Wu, L; Tian, M; Xiang, Y; Gu, K; Shi, G. Learning label semantics for weakly supervised group activity recognition. IEEE Trans. Multimedia; 2024; 26, pp. 6386-6397. [DOI: https://dx.doi.org/10.1109/TMM.2024.3349923]
3. Wu, J., Wang, L., Wang, L., Guo, J., Wu, G.: Learning actor relation graphs for group activity recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9964–9974 (2019)
4. Wang, L; Feng, W; Tian, C; Chen, L; Pei, J. 3d-unified spatial-temporal graph for group activity recognition. Neurocomputing; 2023; 556, 126646. [DOI: https://dx.doi.org/10.1016/j.neucom.2023.126646]
5. Li, S., Cao, Q., Liu, L., Yang, K., Liu, S., Hou, J., Yi, S.: Groupformer: Group activity recognition with clustered spatial-temporal transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13668–13677 (2021)
6. Kim, D., Lee, J., Cho, M., Kwak, S.: Detector-free weakly supervised group activity recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20083–20093 (2022)
7. Gavrilyuk, K., Sanford, R., Javan, M., Snoek, C.G.: Actor-transformers for group activity recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 839–848 (2020)
8. Zhou, H., Kadav, A., Shamsian, A., Geng, S., Lai, F., Zhao, L., Liu, T., Kapadia, M., Graf, H.P.: Composer: compositional reasoning of group activity in videos with keypoint-only modality. In: European Conference on Computer Vision, pp. 249–266 (2022). Springer
9. Du, Z; Wang, X; Wang, Q. Perceiving local relative motion and global correlations for weakly supervised group activity recognition. Image Vis. Comput.; 2023; 137, 104789. [DOI: https://dx.doi.org/10.1016/j.imavis.2023.104789]
10. Shu, Y., Shi, Y., Wang, Y., Zou, Y., Yuan, Q., Tian, Y.: Odn: Opening the deep network for open-set action recognition. In: 2018 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2018). IEEE
11. Yoon, Y; Yu, J; Jeon, M. Spatio-temporal representation matching-based open-set action recognition by joint learning of motion and appearance. IEEE Access; 2019; 7, pp. 165997-166010. [DOI: https://dx.doi.org/10.1109/ACCESS.2019.2953455]
12. Bao, W., Yu, Q., Kong, Y.: Evidential deep learning for open set action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13349–13358 (2021)
13. Zhao, C., Du, D., Hoogs, A., Funk, C.: Open set action recognition via multi-label evidential learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22982–22991 (2023)
14. Choi, W; Savarese, S. Understanding collective activities of people from videos. IEEE Trans. Pattern Anal. Mach. Intell.; 2013; 36,
15. Shu, T., Xie, D., Rothrock, B., Todorovic, S., Chun Zhu, S.: Joint inference of groups, events and human roles in aerial videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4576–4584 (2015)
16. Lin, W; Chu, H; Wu, J; Sheng, B; Chen, Z. A heat-map-based algorithm for recognizing group activities in videos. IEEE Trans. Circuits Syst. Video Technol.; 2013; 23,
17. Lin, W; Sun, M-T; Poovendran, R; Zhang, Z. Group event detection with a varying number of group members for video surveillance. IEEE Trans. Circuits Syst. Video Technol.; 2010; 20,
18. Amer, M.R., Xie, D., Zhao, M., Todorovic, S., Zhu, S.-C.: Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In: Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part IV 12, pp. 187–200 (2012). Springer
19. Amer, M.R., Lei, P., Todorovic, S.: Hirf: Hierarchical random field for collective activity recognition in videos. In: European Conference on Computer Vision, pp. 572–585 (2014). Springer
20. Bagautdinov, T., Alahi, A., Fleuret, F., Fua, P., Savarese, S.: Social scene understanding: End-to-end multi-person action localization and collective activity recognition. IEEE Conference on Computer Vision & Pattern Recognition (2016)
21. Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G.: A hierarchical deep temporal model for group activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1971–1980 (2016)
22. Shu, T., Todorovic, S., Zhu, S.-C.: Cern: confidence-energy recurrent network for group activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5523–5531 (2017)
23. Wang, M., Ni, B., Yang, X.: Recurrent modeling of interaction context for collective activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3048–3056 (2017)
24. Tang, Y., Wang, Z., Li, P., Lu, J., Yang, M., Zhou, J.: Mining semantics-preserving attention for group activity recognition. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1283–1291 (2018)
25. Qi, M., Jie, Q., Li, A., Wang, Y., Luo, J., Gool, L.V.: stagnet: An attentive semantic rnn for group activity recognition. In: Springer, Cham (2018)
26. Tang, J; Shu, X; Yan, R; Zhang, L. Coherence constrained graph LSTM for group activity recognition. IEEE Trans. Pattern Anal. Mach. Intell.; 2019; 44,
27. Shu, X; Tang, J; Qi, G-J; Liu, W; Yang, J. Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Trans. Pattern Anal. Mach. Intell.; 2019; 43,
28. Ehsanpour, M., Abedin, A., Saleh, F., Shi, J., Reid, I., Rezatofighi, H.: Joint learning of social groups, individuals action and sub-group activities in videos. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pp. 177–195 (2020). Springer
29. Yuan, H., Ni, D., Wang, M.: Spatio-temporal dynamic inference network for group activity recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7476–7485 (2021)
30. Lin, W; Chen, Y; Wu, J; Wang, H; Sheng, B; Li, H. A new network-based algorithm for human activity recognition in videos. IEEE Trans. Circuits Syst. Video Technol.; 2013; 24,
31. Yan, R; Xie, L; Tang, J; Shu, X; Tian, Q. Higcin: Hierarchical graph-based cross inference network for group activity recognition. IEEE Trans. Pattern Anal. Mach. Intell.; 2020; 45,
32. Pramono, R.R.A., Chen, Y.T., Fang, W.H.: Empowering relational network by self-attention augmented conditional random fields for group activity recognition. In: European Conference on Computer Vision, pp. 71–90 (2020). Springer
33. Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q.: Social adaptive module for weakly-supervised group activity recognition. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pp. 208–224 (2020). Springer
34. Hu, G., Cui, B., He, Y., Yu, S.: Progressive relation learning for group activity recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 980–989 (2020)
35. Tang, Y; Wei, Y; Yu, X; Lu, J; Zhou, J. Graph interaction networks for relation transfer in human activity videos. IEEE Trans. Circuits Syst. Video Technol.; 2020; 30,
36. Shu, X; Zhang, L; Sun, Y; Tang, J. Host-parasite: graph LSTM-in-LSTM for group activity recognition. IEEE Trans Neural Netw Learn Syst; 2020; 32,
37. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv Neural Inf Process Syst 30 (2017)
38. Tarashima, S., Center, I.: One-shot deep model for end-to-end multi-person activity recognition. In: British Machine Vision Conference (2021)
39. Yuan, H., Ni, D.: Learning visual context for group activity recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence 35, 3261–3269 (2021)
40. Li, W., Yang, T., Wu, X., Du, X.-J., Qiao, J.-J.: Learning action-guided spatio-temporal transformer for group activity recognition. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 2051–2060 (2022)
41. Hu, B., Cham, T.-J.: Entry-flipped transformer for inference and prediction of participant behavior. In: European Conference on Computer Vision, pp. 439–456 (2022). Springer
42. Han, M., Zhang, D.J., Wang, Y., Yan, R., Yao, L., Chang, X., Qiao, Y.: Dual-ai: Dual-path actor interaction learning for group activity recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2990–2999 (2022)
43. Zhu, X., Zhou, Y., Wang, D., Ouyang, W., Su, R.: Mlst-former: Multi-level spatial-temporal transformer for group activity recognition. IEEE Transactions on Circuits and Systems for Video Technology (2022)
44. Du, Z; Wang, X; Wang, Q. Self-supervised global spatio-temporal interaction pre-training for group activity recognition. IEEE Trans. Circuits Syst. Video Technol.; 2023; 33, pp. 5076-5088. [DOI: https://dx.doi.org/10.1109/TCSVT.2023.3249906]
45. Li, F; Wechsler, H. Open set face recognition using transduction. IEEE Trans. Pattern Anal. Mach. Intell.; 2005; 27,
46. Scheirer, WJ; Rezende Rocha, A; Sapkota, A; Boult, TE. Toward open set recognition. IEEE Trans. Pattern Anal. Mach. Intell.; 2012; 35,
47. Jain, L.P., Scheirer, W.J., Boult, T.E.: Multi-class open set recognition using probability of inclusion. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III 13, pp. 393–409 (2014). Springer
48. Bendale, A., Boult, T.E.: Towards open set deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1563–1572 (2016)
49. Neal, L., Olson, M., Fern, X., Wong, W.-K., Li, F.: Open set learning with counterfactual images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 613–628 (2018)
50. Ditria, L., Meyer, B.J., Drummond, T.: Opengan: Open set generative adversarial networks. In: Proceedings of the Asian Conference on Computer Vision (2020)
51. Yang, G; Zhou, S; Wan, M. Open-set recognition model based on negative-class sample feature enhancement learning algorithm. Mathematics; 2022; 10,
52. Yoshihashi, R., Shao, W., Kawakami, R., You, S., Iida, M., Naemura, T.: Classification-reconstruction learning for open-set recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4016–4025 (2019)
53. Oh, H; Kim, SB. Multivariate time series open-set recognition using multi-feature extraction and reconstruction. IEEE Access; 2022; 10, pp. 120063-120073. [DOI: https://dx.doi.org/10.1109/ACCESS.2022.3222310]
54. Huang, H; Wang, Y; Hu, Q; Cheng, M-M. Class-specific semantic reconstruction for open set recognition. IEEE Trans. Pattern Anal. Mach. Intell.; 2022; 45,
55. Roitberg, A., Al-Halah, Z., Stiefelhagen, R.: Informed democracy: voting-based novelty detection for action recognition. arXiv preprint arXiv:1810.12819 (2018)
56. Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? Adv. Neural Inf. Process. Syst. 30 (2017)
57. Sensoy, M., Kaplan, L., Kandemir, M.: Evidential deep learning to quantify classification uncertainty. Adv. Neural Inf. Process. Syst. 31 (2018)
58. Wang, J; Sun, K; Cheng, T; Jiang, B; Deng, C; Zhao, Y; Liu, D; Mu, Y; Tan, M; Wang, X. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell.; 2020; 43,
59. Choi, J., Gao, C., Messou, J.C., Huang, J.-B.: Why can’t I dance in the mall? Learning to mitigate scene bias in action recognition. Adv. Neural Inf. Process. Syst. 32 (2019)
60. Kim, Y-W; Mishra, S; Jin, S; Panda, R; Kuehne, H; Karlinsky, L; Saligrama, V; Saenko, K; Oliva, A; Feris, R. How transferable are video representations based on synthetic data?. Adv. Neural Inf. Process. Syst.; 2022; 35, pp. 35710-35723.
61. Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2969–2978 (2022)
62. Noor, N., Park, I.K.: A lightweight skeleton-based 3d-cnn for real-time fall detection and action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2179–2188 (2023)
63. Zhai, X., Hu, Z., Yang, D., Zhou, L., Liu, J.: Spatial temporal network for image and skeleton based group activity recognition. In: Proceedings of the Asian Conference on Computer Vision, pp. 20–38 (2022)
64. Zhang, J., Jia, Y., Xie, W., Tu, Z.: Zoom transformer for skeleton-based group activity recognition. IEEE Trans. Circuits Syst. Video Technol. 32(12), 8646–8659 (2022)
65. Yue, R; Tian, Z; Du, S. Action recognition based on RGB and skeleton data sets: a survey. Neurocomputing; 2022; 512, pp. 287-306. [DOI: https://dx.doi.org/10.1016/j.neucom.2022.09.071]
66. Yan, R., Tang, J., Shu, X., Li, Z., Tian, Q.: Participation-contributed temporal dynamic model for group activity recognition. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1292–1300 (2018)
67. Li, D., Xie, Y., Zhang, W., Tang, Y., Zhang, Z.: Attentive pooling for group activity recognition. arXiv preprint arXiv:2208.14847 (2022)
68. Mao, K; Jin, P; Ping, Y; Tang, B. Modeling multi-scale sub-group context for group activity recognition. Appl. Intell.; 2023; 53,
69. Sinaga, KP; Yang, M-S. Unsupervised k-means clustering algorithm. IEEE Access; 2020; 8, pp. 80716-80727. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.2988796]
70. Sentz, K., Ferson, S.: Combination of evidence in dempster-shafer theory (2002)
71. Jøsang, A.: Subjective logic (2016)
72. Yang, K., Gao, J., Feng, Y., Xu, C.: Leveraging attribute knowledge for open-set action recognition. In: 2023 IEEE International Conference on Multimedia and Expo (ICME), pp. 762–767 (2023). IEEE
73. Ibrahim, M., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G.: A hierarchical deep temporal model for group activity recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
74. Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136 (2016)
75. Hendrycks, D., Mazeika, M., Dietterich, T.: Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606 (2018)
76. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
77. Chen, G., Qiao, L., Shi, Y., Peng, P., Li, J., Huang, T., Pu, S., Tian, Y.: Learning open set network with discriminative reciprocal points. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pp. 507–522 (2020). Springer
78. Krishnan, R., Subedar, M., Tickoo, O.: Bar: Bayesian activity recognition using variational inference. arXiv preprint arXiv:1811.03305 (2018)
79. Wang, C; Mohamed, ASA. Attention relational network for skeleton-based group activity recognition. IEEE Access; 2023; 11, pp. 129230-129239. [DOI: https://dx.doi.org/10.1109/ACCESS.2023.3332651]
80. Li, Y; Liu, Y; Yu, R; Zong, H; Xie, W. Dual attention based spatial-temporal inference network for volleyball group activity recognition. Multimedia Tools Appl.; 2023; 82,
Copyright Springer Nature B.V. Jan 2025