Content area
Unsupervised person re-identification (Re-ID) is a critical and challenging task in computer vision. It aims to identify the same person across different camera views or locations without using any labeled data or annotations. Most existing unsupervised Re-ID methods adopt a clustering and fine-tuning strategy, which alternates between generating pseudo-labels through clustering and updating the model parameters through fine-tuning. However, this strategy has two major drawbacks: (1) the pseudo-labels obtained by clustering are often noisy and unreliable, which may degrade the model performance; and (2) the model may overfit to the pseudo-labels and lose its generalization ability during fine-tuning. To address these issues, we propose a novel method that integrates silhouette coefficient-based label correction and contrastive loss regularization based on loose–tight cluster guidance. Specifically, we use silhouette coefficients to measure the quality of pseudo-labels and correct the potential noisy labels, thereby reducing their negative impact on model training. Moreover, we introduce a new contrastive loss regularization term that consists of two components: a cluster-level contrast loss that encourages the model to learn discriminative features, and a regularization loss that prevents the model from overfitting to the pseudo-labels. The weights of these components are dynamically adjusted according to the silhouette coefficients. Furthermore, we adopt Vision Transformer as the backbone network to extract more robust features. We conduct extensive experiments on several public datasets and demonstrate that our method achieves significant improvements over the state-of-the-art unsupervised Re-ID methods.
Introduction
Unsupervised person re-identification (Re-ID) is an important and challenging problem in computer vision and surveillance [1, 2], and has been widely used in many computer vision or computer graphics fields [3, 4, 5–6]. It involves identifying the same person across different camera views or locations without using any labeled data or annotations. Traditional supervised Re-ID methods rely on labeled datasets, which are costly and time-consuming to collect. In contrast, unsupervised Re-ID methods aim to develop models that can perform person re-identification using only unlabeled data. However, without any ground truth identification labels, this task becomes even more difficult. In recent years, due to the high demand for intelligent video surveillance in practical applications and the increasing difficulty of annotating surveillance data, unsupervised person re-identification has attracted more attention in the computer vision community. Therefore, designing efficient and robust unsupervised person re-identification systems is of great significance, both academically and industrially.
Unsupervised person re-identification (Re-ID) can be classified into two main categories: unsupervised domain adaptation (UDA) and fully unsupervised learning (FU). UDA methods aim to adapt a model trained on a labeled source domain to an unlabeled target domain [7, 8, 9–10]. This reduces the domain gap between the source and target domains, enabling the model to generalize better in the target domain. In contrast, FU methods seek to achieve person re-identification without any labeled information, relying only on unlabeled data from the target domain [11, 12]. This requires the model to learn valuable features and information from unsupervised data. FU methods usually pre-train on ImageNet first, then perform clustering based on pseudo-label prediction, and finally use pseudo-labels for model training. This approach is often called clustering and fine-tuning [13]. Fine-tuning usually involves training the network using a contrastive loss, such as triplet loss [14, 15], InfoNCE loss [16], or other no-parametric classification loss [17]. Generally, UDA methods outperform FU methods for person re-recognition as UDA methods can leverage the learned features from the labeled source domain to improve the performance in the target domain. Our paper focuses primarily on the FU setting.
Fig. 1 [Images not available. See PDF.]
Illustration of the silhouette coefficient and the loose–tight cluster. a The schematic diagram of silhouette coefficients, where different colors represent different clusters. is the average distance between sample and all other samples in the same cluster. is the minimum distance between sample and the nearest cluster. is the value of silhouette coefficient. b The schematic diagram of loose–tight cluster, the original cluster is represented by the cluster on the left with a gray background, which contains numerous samples of information. After reducing the neighborhood radius , the original cluster splits into multiple compact clusters (as indicated by different colors in the figure), including outliers. These compact clusters exhibit better purity than the original cluster
Existing state-of-the-art FU methods incorporate pseudo-labeling-based algorithms into a hybrid memory-based contrast learning framework [18, 19]. These methods typically start by using a feature extractor to extract image features from all training data and store them in a memory bank. Then, clustering techniques, such as DBSCAN [20] or K-means [21], are applied to group these image features and generate pseudo-labels, assigning each image a pseudo-label that represents its identity. Finally, the deep model is trained using the corresponding image features in the memory bank, with the help of contrastive loss or other identity classification losses. This memory-based approach has the following advantages: (1) The memory bank enables the model to be continuously improved, as it can accumulate and store useful information during the training process, and then update the parameters based on this knowledge to enhance the model’s performance. (2) Memory-based methods are usually more scalable and suitable for processing large-scale datasets. They can efficiently handle millions or billions of samples, which is important for practical large-scale surveillance and smart video applications. Therefore, this memory-based approach can overcome the limitation of the GPU’s memory during the training process and optimize the deep model globally in each epoch.
Although memory-based contrastive learning methods have achieved remarkable performance, there is still a large gap between FU methods and supervised learning methods. After analyzing these methods, we identify two key challenges for improving FU methods: (1) how to detect samples with incorrect pseudo-label assignments; (2) how to mitigate their impact during the optimization process. Regarding the first challenge, some previous works have proposed some useful solutions. For example, UNRN [22] proposed an uncertainty assessment strategy, which estimated pseudo-label confidence by calculating the inconsistency of soft multi-label predictions between two models. SPCL [18] proposed a clustering reliability criterion for selecting reliable clusters. However, UNRN requires maintaining two models and lacks simplicity, and SPCL requires a threshold for the reliability criterion, which does not effectively guide the correction of erroneous pseudo-labeling. Regarding the second challenge, most of the existing methods focus on generating more robust soft supervision, while ignoring to actively reduce the noise rate.
To address the first challenge, we introduce a noise correction module, which is used for confidence estimation and noisy label correction. We use the silhouette coefficient [23] to evaluate the quality of clustering results. Its value reflects the similarity of a sample to other samples within the same cluster (cohesion) and the dissimilarity to samples in other clusters (separation). The silhouette coefficient ranges from – 1 to 1, where a value closer to 1 indicates a higher similarity between a data point and its own cluster, and a lower similarity to neighboring interfering clusters. These characteristics of the silhouette coefficient provide us with a theoretical basis for estimating and utilizing the confidence of individual samples.
To address the second challenge, we introduce the concept of loose–tight clusters and incorporate confidence into loose–tight cluster regularization. This approach aims to mitigate the adverse effects of noisy labels on the training process. Loose–tight cluster is a concept used to describe the distribution characteristics of samples within clusters. In this context, loose clusters refer to samples within a cluster being relatively dispersed or not compact in the feature space, while tight clusters indicate that samples within a cluster are more densely concentrated or similar in the feature space. Moreover, we introduce a correction module based on silhouette coefficients, aimed at actively reducing the noise rate of pseudo-labels. Our main contributions are summarized as follows:
We propose to exploit the neighborhood radius of the DBSCAN algorithm to control the tightness of the clustering clusters, thereby improving the purity of the clusters.
We introduce the loose–tight cluster regularization algorithm to address the noise during the clustering process, thereby enhancing the robustness of the model.
Under the unsupervised person re-identification setting, our proposed method outperforms most state-of-the-art algorithms across multiple datasets. Particularly, on the MSMT17 dataset, our approach achieves a significant improvement in mAP, surpassing the current state-of-the-art methods by 1.1%.
Related works
Unsupervised person re-identification
Unsupervised person Re-ID can be classified into two main categories, namely unsupervised domain adaptation (UDA) and fully unsupervised learning (FU). However, neither UDA nor FU can avoid the effect of noise in the pseudo-labeling on the performance of the model. In recent years, research efforts have mainly focused on obtaining auxiliary information for fine-tuning labels, aiming to alleviate the issue of label noise. SSL [24] utilizes a label-softened classification network by introducing auxiliary information. PPLR [25] aggregates the predictions of partial features using the complementary relationship between global and local features and jointly mitigates the noise in the clustering. CASTOR [26] utilizes a pre-trained camera classification network to adaptively compensate for the adverse effects of inter-camera distribution bias on clustering. It also employs an asymptotic outlier recovery strategy to redistribute some of the outliers into clustering groups for unsupervised person re-identification.
Vision transformer in unsupervised person re-identification
Person re-identification aims to search for specific individuals across different scenes and cameras. Robust feature extraction from person images has always been a critical component of Re-ID algorithms, and it has long been dominated by methods based on convolutional neural networks (CNNs) [27, 28, 29, 30, 31–32]. Reviewing the CNN-based methods, it is evident that there are two challenges in person re-identification algorithms that have not been effectively addressed. (1) Exploring rich structural information on a global scale is crucial for person re-identification. However, CNN-based methods have primarily focused on small discriminative regions. While attention modules have been introduced to explore long-range dependencies and mitigate this limitation, they are often embedded in deep layers and cannot address the fundamental issue of CNNs in extracting multiple diverse discriminative regions; (2) fine-grained features with detailed information are also crucial for person re-identification. However, the downsampling operators in CNNs reduce the spatial resolution of the output feature maps, significantly impacting the discriminative capability for distinguishing similar person [33, 34].
Recently, Vision Transformers (ViTs) [35] have shown that the original Transformer architecture can be as effective as CNN-based methods for image feature extraction. In 2021, Luo et al. [36] proposed TransReID, a person re-identification method entirely based on Transformers. This approach outperformed state-of-the-art CNN-based algorithms, demonstrating a significant performance improvement. However, TransReID is a supervised method and uses a pre-trained model of ImageNet. Since the image contents of ImageNet and Re-ID datasets are very different and ImageNet’s supervised pre-training focuses on category-level supervision, it lacks rich visual information. Therefore, Transformer-based person re-identification algorithms require pre-training on larger-scale and more task-specific datasets to avoid overfitting to the pre-training dataset. A large-scale pre-training dataset called LUPerson [37] with a large collection of unlabeled person images has been constructed by researchers.
Memory bank updating strategy
Memory bank updating strategy is a well-studied research direction in the field of deep learning, aiming to improve the training efficiency and performance of deep neural network models. Wu et al. proposed a method that uses a memory bank to store features of historical data to improve the feature learning process of self-supervised learning. This method introduces a memory mechanism in model training, which helps to improve the performance of contrastive loss or other self-supervised tasks. The state-of-the-art unsupervised person re-identification methods have also incorporated a memory bank for contrastive learning, introducing various strategies to continuously update the memory bank. In particular, for the FU person re-identification task, Ge et al. [18] introduced a self-paced contrastive learning approach tailored for FU person Re-ID tasks. Zheng et al. [22] also proposed leveraging the uncertainty of samples to address the FU person re-identification task, building upon memory-based contrastive learning. These methods utilize the memory bank to measure the similarity between samples and instances stored in the memory bank, thereby aiding in cross-batch information retrieval from the sample data.
Silhouette coefficient
Silhouette coefficient is a widely used metric for evaluating the quality of clustering results, which is crucial for various fields such as data mining and pattern recognition. Rousseeuw [23] introduced the silhouette coefficient in 1987 and explained its utility in determining the consistency of clusters. It quantifies the similarity of an object to its own cluster (cohesion) compared to other clusters (separation), with values ranging from – 1 to 1. Silhouette coefficient is often used in combination with other techniques to improve the quality of unsupervised learning results. For example, it is commonly used in hierarchical clustering [38] to determine the optimal number of clusters. Moreover, it has been used in feature selection and dimensionality reduction methods to enhance the effectiveness of clustering algorithms. Silhouette coefficient has the following advantages for unsupervised learning tasks: (1) It does not require any prior knowledge. (2) It considers the similarity of samples within clusters and between different clusters, which enables a more comprehensive assessment of the cluster quality. With these advantages, we use silhouette coefficient to estimate the confidence level of pseudo-labeling.
Discussion
Despite the excellent performance of the unsupervised methods mentioned above, they increase the model complexity. We propose to use silhouette coefficient to effectively improve the pseudo-labeling of individual instances without additional auxiliary features. Furthermore, we employ the Transformer as the backbone network for unsupervised person re-identification, following the configuration established by Luo et al. [36]. We conduct pre-training using the carefully curated LUPerson dataset. However, the Transformer still faces challenges related to clustering noise in the context of unsupervised person re-identification. NLSC [39] incorporates confidence into InfoNCEloss to mitigate the negative impact of noisy labels on the training process, achieving excellent performance. However, the presence of noisy samples causes an issue where the centroids of different clusters tend to be too close to each other. Building upon NLSC, we introduce the loose–tight cluster regularization (LTCR) algorithm. It adjusts the neighborhood radius of the DBSCAN algorithm to achieve varying degrees of cluster tightness. Tighter clusters are less likely to contain noisy samples; therefore, we propose to reduce the impact of noisy pseudo-labels on model performance by decreasing the distance between silhouette coefficient samples and the centroids of their corresponding tight clusters.
To obtain a more accurate centroid, we introduce momentum updates into the cluster center feature updates, following [40]. This achieves dynamic and continuous updates of cluster center features. During the initialization of cluster features, we use silhouette coefficients to weight the cluster features. This weighted approach, unlike the conventional direct averaging, results in more accurate cluster centers, which benefits the model’s training.
Fig. 2 [Images not available. See PDF.]
The overall architecture of the proposed LTCR. The LTCR consists of two main components: i Correction module: In order to deal with noisy pseudo-labels, we compute a correction matrix through a unified clustering noise suppression framework, which is used to correct the distances of hard samples in the clustering distance matrix. ii Loose–tight cluster regularization: The original clustering results obtained in the correction module are potentially unreliable, and the impact of noisy samples on model training should be reduced. We utilize the silhouette coefficient as a moderator between the contrastive loss at the original cluster level and the contrastive loss in the loose–tight cluster bootstrap
Methodology
We propose a contrast loss regularization based on loose–tight cluster guidance to effectively reduce the clustering noise. The overall framework is illustrated in Fig. 2a. It consists of two main modules: (1) The noise suppression module guided by silhouette coefficients; (2) the correction module integrating contrastive losses and regularized losses. Specifically, we use the Vision Transformer [35] as the backbone network for extracting semantic features of input images. Similar to [18], we use the DBSCAN [20] algorithm to group the extracted instance features into several clusters and assign pseudo-labels to all clusters. Based on the idea of re-ranking, we introduce the Jaccard distance based on k-reciprocal nearest neighbors [41] as the distance metric of DBSCAN to improve the clustering effectiveness. We use silhouette coefficients to assess the outcomes of clustering. The loss function used in model training comprises two distinct components: firstly, the clustered cluster-level contrastive loss, which is guided by silhouette coefficients; and secondly, the regularization loss, which is guided by the concept of loose–tight clusters. The influence of these two loss types is balanced by silhouette coefficients, directing the model’s updates. For reliable samples with high silhouette coefficients, the model mainly learns from the clustered cluster-level contrastive loss. Conversely, for less reliable samples with low silhouette coefficients, the model mainly learns from the regularization loss.
The training process scheme alternates between the following three steps: (1) Instance features are clustered into clusters, and pseudo-labels are assigned using the DBSCAN algorithm. (2) The noise suppression framework corrects the clustering results and dynamically updates the memory bank. (3) The model is updated by selecting a loss function based on silhouette coefficients.
Preview
Silhouette coefficient [23] is a criterion for the quality of clustering, which reflects the similarity of each sample with its respective cluster and with other clusters. The values of the coefficients range from – 1 to 1, with larger coefficients indicating better clustering results. For sample (the mth cluster), the silhouette coefficient is defined as:
1
2
3
where denotes the number of samples in the cluster and is the Jaccard distance [41] between sample and . denotes the average distance between sample and all other samples within , is the minimum average distance of sample to any other cluster. 2 and 3 provide the formulas for and , respectively. Specifically, when is less than 0, the sample is likely to be assigned to a wrong cluster. Therefore, we modify the silhouette coefficient as follows: .Correction module based on silhouette coefficient
As shown in the correction module of Fig. 2b, due to external factors, people with different identities might be grouped into the same cluster, while people with the same identity might be distributed across different clusters. In the first case, the samples will usually have larger and , and thus the weight of the split-matrix should be increased. To effectively split the samples that are misclassified into the same cluster, we use spectral clustering [42] to further divide each cluster into three subclasses. For sample pair , the is computed as:
4
5
where is the maximum distance between samples of the same cluster in DBSCAN. is the scaling factor. Specifically, indicates the yth subclass of the mth cluster , , . n is usually a small number, for convenience we set it to 2. The value of is determined by , which is used to control the distance between subclasses in the same cluster. In the second case, the sample usually has smaller and , thus the merge-matrix dominates in this case. Similarly, we use to compute as:6
7
If more than 2/3 of the samples within share the same nearest cluster, we denote this nearest cluster as . We use to pull and closer together. Finally, we use and to compute the correction matrix as: . The final matrix for clustering is: , where is the pre-computed Jaccard distance matrix with k-reciprocal nearest neighbors. For convenience, we initialize to 0 in the first epoch. Notably, spectral clustering allows most of the pedestrian images with different IDs in the same cluster can be distinguished. But when the number of identities in a cluster is less than three, samples with the same identity may also be split. This may seem to harm the model performance, but it is noteworthy that these mistakenly separated samples can be eventually split back into the same cluster with the help of and .We adopt the k-reciprocal nearest neighbors-based Jaccard distance matrix to calculate the silhouette coefficients, and the value of k has a significant effect on the clustering results. In previous studies, k was typically set to 20 or 30. However, many clusters larger than 60 are still present in the DukeMTMC-Re-ID and MSMT17 datasets. With these large clusters, a smaller k would underestimate their silhouette coefficients.
According to [41], we can infer that for any two samples within the same cluster, if they are not each other’s k-reciprocal nearest neighbors, their Jaccard distance is 1, which is the maximum value. Therefore, the size of k is crucial. When a cluster’s size is much larger than k, most of the sample distances within the same cluster will be 1, leading to a larger than expected, and consequently a lower . Similarly, choosing a large k will result in more samples with different IDs being considered as k-reciprocal nearest neighbors, and their silhouette coefficients will be inaccurate.
To minimize the effect of different cluster sizes on the silhouette coefficients, the Jaccard distance matrices with different k were combined to compute the silhouette coefficients. Specifically, we choose three different values of k, such as 30, 60, and 90 to compute the original matrices . Then, we combine and calculate these matrices as follows:
8
We use M to denote both the ith column and the ith row of M due to the symmetry of M. is the size of the cluster containing the sample .Fig. 3 [Images not available. See PDF.]
The performance of different k on the MSMT17 and Market1501 datasets
Figure 3 demonstrates the effect of different values of k on the model effect, it is worth noting that the silhouette coefficients computed by combining the Jaccard distance matrix with different values of k significantly enhance the model performance.
Loose–tight cluster regularization
The neighborhood radius is a critical parameter in the density-based clustering algorithm DBSCAN. It refers to the maximum distance between samples within the same cluster. To mitigate the impact of noisy samples on model training when the initial clustering results are unreliable, we introduce a hyperparameter to obtain more stringent clusters than the original clustering results. These clusters are not only purer but also serve as a supplementary refinement to the existing outcomes. When samples have a low silhouette coefficient, it indicates a higher likelihood that they are in the wrong clusters. In this situation, it is appropriate to increase the contribution of tightly clustered groups to the model’s training, while reducing the impact of the original clustering results on the model training process. Based on the above considerations, we use the silhouette coefficient as a moderator between the original clustered cluster-level contrast loss and the loose–tight cluster-guided contrast loss. A lower silhouette coefficient corresponds to a higher regularization strength, mitigating the issue of clustering centroids of different categories being brought close due to noisy samples. Although the diversity of same-category samples within compact clusters is reduced, compared to conventional regularization methods that use instance features augmented through data augmentation as class prototypes, the sample distribution within compact clusters is richer and more authentic. This approach can lead to the training of a more robust network. We use as its credibility weight. For a typical feature vector , our formula for the loose–tight cluster-guided contrastive loss in a mini-batch of training samples is as follows:
9
where is the prototype of the cluster to which the feature vector belongs, and is the prototype of the compact cluster to which the feature vector belongs. and correspond to the number of categories within compact clusters and the original clusters, respectively. serves as an amplification factor to magnify the differences in sample contribution caused by . is the temperature coefficient used to control the output probability distribution, with an empirical setting of 0.05. represents the inner product of two feature vectors. Notably, 9 holds if and only if the feature vector belongs to both and , otherwise it degenerates to a cluster-level contrast loss.When the neighborhood radius is reduced, the previous clusters may split into multiple smaller clusters and outliers will appear (gray points in the figure). Recalculating the cluster centroids for each small cluster, and using the aforementioned contrastive loss, brings low silhouette coefficient samples closer to the centroids of their respective small clusters. This ensures that the network is updated in the correct direction. Otherwise, similar noisy samples may undermine the fine-grained classification capability in unsupervised person re-identification. Therefore, it differs from the silhouette coefficient-guided contrastive loss method proposed by NLSC [39], which only reduces the weight of noisy samples in model training. The proposed loose–tight cluster regularization aims to facilitate the model to move closer to the centroids of pure small clusters when the original clustering results are suboptimal. This strategy is designed to suppress the misclassification of similar samples into the same cluster and ensure their dispersion in the feature space during the training process of unsupervised person re-identification models.
Fig. 4 [Images not available. See PDF.]
The initialization process of the memory bank
Memory bank update
We store all cluster centers in the memory bank, where N represents the number of clusters in the dataset. Initially, we extract instance features using the ViT model pre-trained on LUPerson. Then, we apply the DBSCAN [20] method to cluster these instances into several clusters for initialization. The clustering results can significantly affect the learned representations, and ideally, all samples should be correctly assigned to their true clusters. However, in practice, noise is inevitable. If a sample is erroneously assigned to a neighboring cluster, it often has a lower silhouette coefficient. To reduce the influence of noisy samples on cluster centers, we use silhouette coefficients as weights for feature-weighted averaging during the initialization of the memory bank.
Since the features of each sample of the training dataset change as the training process proceeds, timely updating of the features in the cluster memory dictionary is crucial to improve the model performance. Similar to the cluster contrast, we rerun the clustering algorithm before each epoch to refresh the cluster centroid features. At the same time, we use a momentum-based updating approach to achieve dynamic and continuous updates of the cluster centroid features. We store the cluster centroid features in a memory bank to facilitate the computation of . This approach allows us to use the latest feature vectors from the cluster members to update the cluster centroid features in the memory bank. The feature update formula is as follows:
10
where is the momentum updating factor, empirically set as 0.1, and denotes the average of the ith class instance features in the mini-batch. Besides using momentum-based updates for improved stability, we also use silhouette coefficients to initialize the cluster features before each epoch, aiming to obtain more accurate cluster centroids. The conventional method of direct averaging often introduces more noise, while using silhouette coefficient-weighted cluster features can effectively increase their distance from noisy samples, thereby benefiting the model training process. The accuracy of the clustering results gradually improves as the model is updated, and eventually better feature representations can be learned. As shown in Fig. 4, for a sample , we define its weight as: , the cluster centroid is computed as:11
where represents the feature vector of sample . If each is nearly equal, the outcome of this method will be equivalent to the arithmetic mean.Table 1. Comparison with state-of-the-art methods
Methods | Reference | Backbone | Market1501 | MSMT17 | DukeMTMC | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mAP | R1 | R5 | R10 | mAP | R1 | R5 | R10 | mAP | R1 | R5 | R10 | |||
Unsupervised domain adaptation | ||||||||||||||
MMT [34] | ICLR’20 | Res50 | 73.8 | 89.5 | 96.0 | 97.6 | 24.0 | 50.1 | 63.5 | 69.3 | 62.3 | 76.3 | 87.7 | 91.2 |
MEB [43] | ECCV’20 | Res50 | 76.0 | 89.9 | 96.0 | 97.5 | – | – | – | – | 66.1 | 79.6 | 88.3 | 92.2 |
SpCL [18] | NeurIPS’20 | Res50 | 76.7 | 90.3 | 96.2 | 97.7 | 26.8 | 53.7 | 65.0 | 69.8 | 68.8 | 82.9 | 90.1 | 92.5 |
GLT [44] | CVPR’21 | Res50 | 79.5 | 92.2 | 96.5 | 97.8 | 27.7 | 59.5 | 70.1 | 74.2 | 69.2 | 82.0 | 90.2 | 92.8 |
AdaDC [45] | TCSVT’22 | Res50 | 82.8 | 92.6 | 97.7 | – | 32.7 | 60.7 | 73.6 | – | 71.4 | 82.3 | 91.6 | – |
AdaMG [46] | TCSVT’23 | Res50 | 84.6 | 93.9 | 97.9 | 98.9 | 38.0 | 66.3 | 76.9 | 80.6 | – | – | – | – |
Fully unsupervised (Res50) | ||||||||||||||
SSL [24] | CVPR’20 | Res50 | 37.8 | 71.7 | 83.8 | 87.4 | – | – | – | – | 28.6 | 52.5 | 63.5 | 68.9 |
HCT [47] | CVPR’20 | Res50 | 56.4 | 80.0 | 91.6 | 95.2 | – | – | – | – | 50.7 | 69.6 | 83.4 | 87.4 |
CycAs [48] | ECCV’20 | Res50 | 64.8 | 84.8 | – | – | 26.7 | 50.1 | – | – | 60.1 | 77.9 | – | – |
SpCL [18] | NeurIPS’20 | Res50 | 73.1 | 88.1 | 95.1 | 97.0 | 19.1 | 42.3 | 55.6 | 61.2 | 65.3 | 81.2 | 90.3 | 92.2 |
ClusterContrast. [49] | ACCV’22 | Res50 | 82.1 | 92.3 | 96.7 | 97.9 | 27.6 | 56.0 | 66.8 | 71.5 | 72.6 | 84.9 | 91.9 | 93.9 |
ISE [50] | CVPR’22 | Res50 | 85.3 | 94.3 | 98.0 | 98.8 | 37.0 | 67.6 | 77.5 | 81.0 | – | – | – | – |
NLSC [39] | ICME’22 Res50 | 85.7 | 94.2 | 98.0 | 98.7 | 36.5 | 67.8 | 77.7 | 81.2 | 75.4 | 87.0 | 93.9 | 95.4 | |
PPLR [25] | CVPR’22 | Res50 | 81.5 | 92.8 | 97.1 | 98.1 | 31.4 | 61.1 | 73.4 | 77.8 | – | – | – | – |
STDA [51] | T-ITS’23 | Res50 | 82.7 | 93.1 | 97.3 | 98.4 | 31.8 | 62.6 | 73.4 | 77.5 | – | – | – | – |
LTP [52] | TIP’23 | Res50 | 85.8 | 94.5 | 97.8 | 98.7 | 39.5 | 67.9 | 78.0 | 81.6 | – | – | – | – |
DCCT [53] | TCSVT’23 | Res50 | 86.3 | 94.4 | 97.7 | 98.5 | 41.8 | 68.7 | 79.0 | 82.6 | – | – | – | – |
LTCR(Res50) | This paper | Res50 | 85.9 | 93.8 | 97.2 | 98.3 | 40.8 | 68.9 | 79.3 | 82.9 | 74.1 | 86.0 | 92.5 | 94.3 |
Fully Unsupervised (ViT) | ||||||||||||||
PASS [54] | ECCV’22 | ViT | 88.5 | 94.9 | – | – | 41.0 | 67.0 | – | – | – | – | – | – |
TransReID-SSL [55] | arxiv’21 | ViT | 88.2 | 94.2 | – | – | 40.9 | 66.4 | – | – | – | – | – | – |
SSPW [56] | ICSP’22 | ViT | 89.8 | 95.1 | – | – | 42.4 | 67.1 | – | – | – | – | – | – |
TransCL [57] | IJCNN’22 | ViT | 82.9 | 93.0 | 97.3 | 98.3 | 41.3 | 68.6 | 79.3 | 83.0 | – | – | – | – |
LTCR(ViT) | This paper | ViT | 88.7 | 94.9 | 98.1 | 98.7 | 43.5 | 68.7 | 77.7 | 83.1 | 75.6 | 87.0 | 94.1 | 95.3 |
Bold represents the best result
Experiments
Datasets and evaluation metrics
We conducted a series of experiments on three widely recognized public datasets for person re-identification: DukeMTMC-Re-ID (Duke) [58], Market1501 (Market) [59], and MSMT17 (MSMT) [60]. These experiments were conducted to validate the effectiveness of the proposed method. We evaluate the performance of the algorithm in the retrieval task using mean average precision (mAP) and Rank-1 from the cumulative matching characteristic (CMC) curve. Additionally, we also use the Fowlkes–Mallows index (FMI) to assess the degree of match between the clustering results and the ground truth labels.
Implementation details
We use the Transformer pre-trained on the LUPerson [37] as the backbone network. For the input training images, we apply data augmentation techniques such as random cropping, erasing, and flipping. The LUPerson contains 4.18 million unlabeled images of the human body from 50,534 online videos. To evaluate the performance of the Re-ID task, we conduct experiments on two popular benchmarks, namely Market1501 and MSMT17. They contain 32,668 images for 1501 identities and 126,441 images for 4101 identities, respectively. During both training and inference stages, the images from these two datasets are resized to dimensions of 256128 pixels. Each mini-batch consists of 256 person images, with 32 pseudo-IDs included (each containing eight instances). All experiments use the SGD optimizer to train the Re-ID model, with a weight decay of 5e-4. The initial learning rate is set to 3.5e4, which is reduced by a factor of 1/10 every 20 epochs, resulting in a total of 50 epochs of training. Similar to [18], we perform clustering using DBSCAN and the Jaccard distance with k-reciprocal nearest neighbors before each epoch. The value of k for Jaccard distance is set to 30. For DBSCAN, the maximum distance between adjacent samples within the same cluster is set to 0.7. The minimum number of samples at core points MinPts is set to 4. The weighting factor is set to 4 and the scaling factor is set to 200. In addition, the for tight clusters is set to 0.55.
Comparison with state-of-the-arts
We compare LTCR with several state-of-the-art unsupervised person re-identification methods, and the results are shown in Table 1. Our LTCR significantly outperforms the best FU method by 1.1% in mAP accuracy on the MSMT17 dataset. Furthermore, compared with TransReID-SSL, our mAP accuracy is also improved by 0.5% and 2.6% on the Market1501 and MSMT17 datasets, respectively.
We also compare the proposed method with the state-of-the-art UDA Re-ID method in Table 1. The UDA method uses the labeled source domain dataset to assist the training on the unlabeled target datasets, which usually achieves better performance than the FU Re-ID methods. However, our method significantly surpasses the state-of-the-art unsupervised domain adaptation (UDA) methods without extra training data. Moreover, our method enhances the quality of pseudo-labels by suppressing false proximity of similar samples and can be easily integrated into existing UDA methods.
As shown in Table 1, in order to get a more effective evaluation of our model, we conducted further experiments with Resnet-50 (Res50) pre-trained on ImageNet as the backbone network, which did not perform as well as the framework with Transformer as the backbone network, but also achieved excellent performance.
Table 2. Ablation studies of our proposed LTCR using the Market1501 and MSMT17 datasets
Regularization | Correction module | Accurate cluster centroids | Market1501 | MSMT17 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
mAP | top-1 | top-5 | top-10 | mAP | top-1 | top-5 | top-10 | |||
89.4 | 95.0 | 98.2 | 99.1 | 36.6 | 61.9 | 71.3 | 77.5 | |||
88.4 | 94.7 | 98.0 | 98.8 | 41.1 | 66.5 | 77.3 | 81.1 | |||
88.7 | 95.0 | 98.2 | 99.0 | 43.0 | 68.2 | 78.4 | 82.2 | |||
88.2 | 94.5 | 97.9 | 98.6 | 39.6 | 64.5 | 74.9 | 79.2 | |||
88.3 | 94.8 | 98.2 | 98.9 | 30.1 | 54.1 | 65.1 | 69.9 | |||
87.7 | 94.4 | 97.7 | 98.6 | 30.2 | 53.5 | 64.5 | 69.6 | |||
89.6 | 95.5 | 98.5 | 99.0 | 43.5 | 68.7 | 77.7 | 83.1 | |||
Ablation study
The proposed person Re-ID method consists of two main novel components: (1) loose–tight cluster regularization; (2) correction module: a unified framework for confidence estimation and pseudo-label noise correction. To investigate the contribution of each component to the model performance, we use Vision Transformer as the backbone network and conduct ablation experiments on the Market and MSMT17 datasets. Ablation studies were conducted at their respective optimal parameters. Results are reported in Table 2.
Comparison studies with baseline. We follow the approach [61] and use a subset of the LUPerson dataset selected based on the catastrophic forgetting score to pre-train the Transformer and establish a new baseline model. We use the corrected silhouette coefficients to identify and correct the noise in the initial clustering results. The new baseline model is compared with the “Baseline (ViT)" in Table 3. After incorporating our proposed loose–tight cluster regularization, a consistent improvement in both mAP performance and FMI scores is achieved across different datasets, which indicates that using calibrated silhouette coefficients to estimate pseudo-labeling confidence and adjusting the training weights between the original clusters and the tight clusters accordingly can suppress the loss of the fine-grained classification ability, which can in turn enhance the model performance and the discriminative power of the person features. It is well known that the stronger the baseline, the more difficult it is to improve the model performance. Based on the Transformer backbone network, which has already brought great improvement, the regularization loss based on the loose–tight cluster bootstrap still performs well, which fully demonstrates the effectiveness of this strategy. Moreover, the regularization strategy, in addition to boosting the model’s mAP metrics, also adds 5% accuracy to the FMI metrics. This enhancement illustrates that using the regularization loss can alleviate the problem of decreasing the extraction ability of the model’s fine-grained features. Fine-grained features are crucial for pedestrian re-identification model, the higher the feature differentiation between different categories, the more accurate the clustering result, which is reflected in the index as the improvement of FMI.
Effectiveness of the Regularization. When the feature belongs to both the original clusters and the tight clusters, the regularization degenerates into a clustering cluster-level contrast loss. In order to verify the effectiveness of regularization, we utilize the clustering cluster-level contrast loss instead of regularization for ablation studies. As shown in Table 2, the ablation experiment reveals that the regularization loss significantly improves model performance. Especially, for the more challenging MSMT17 dataset with complex scene variations and longer time spans, the performance gain from the regularization is even more noticeable, resulting in a remarkable improvement of 6.9 percentage points in mAP.
Effectiveness of the correction module. We correct the incorrect labels using a correction module, where the correction module corrects the incorrect labels by changing the Jaccard distance matrix. To verify the effectiveness of this strategy, we evaluate the performance of the framework without the addition of the correction matrix, and the mAP accuracy degradation is shown in Table 2. A degradation in mAP performance can be observed. In particular, this strategy has a greater impact when using the MSMT dataset, which is more challenging and more likely to contain mislabeling.
Effectiveness of the Accurate Cluster Centroids. We use silhouette coefficients to weight the clustered features, which are used to increase the distance of the clustered features from the noisy samples. To evaluate the effectiveness of this strategy. As a comparison, we use direct averaging to describe the clustering features. As shown in Table 2, the effectiveness is improved after using the accurate cluster centroids. Without the help of accurate cluster centroids, noise labels will bring more contamination to the cluster centroids and destroy the performance of the model. Because the traditional method of obtaining the cluster centroids by directly averaging the clustered features, the traditional method of obtaining the cluster centroids treats each sample in the clusters equally, which may bias the cluster centroids toward the noisy samples, leading to misdirection of network optimization.
Table 3. Comparative experiments between our method and baseline
Methods | Market1501 | MSMT17 | ||||
|---|---|---|---|---|---|---|
mAP | R1 | FMI | mAP | R1 | FMI | |
Baseline [61] | 88.2 | 94.2 | 0.86 | 40.9 | 66.4 | 0.56 |
Ours | 89.6 | 95.5 | 0.87 | 43.5 | 68.7 | 0.61 |
Bold represents the best result
Fig. 5 [Images not available. See PDF.]
Parameter analysis on MSMT17
Parameter analysis
We further analyze the sensitivity of our method to five hyperparameters on the MSMT17 dataset. These hyperparameters include the radius of the neighborhood of the compact cluster , the original cluster neighborhood radius , the scaling factor for noise label correction, the amplification factor for loose–tight cluster regularization, and the number of clusters for spectral clustering. In our experiments, we adjust one parameter’s value at a time while keeping the others constant.
The hyperparameter is the maximum distance between samples within the same cluster. Figure 5a examines the influence of the compact cluster neighborhood radius on the results. The selection of needs to balance the purity and diversity of the compact clusters. A small ensures pure cluster composition without noise, but it can also lead to more cluster outliers and lower sample diversity, which is unfavorable for improving model performance. A large enhances sample diversity but also increases the probability of including noise. Hence, choosing an appropriate is crucial. Through experimentation, we found that slightly smaller than the original value of by 0.15 yields the best performance. In the case of MSMT17, the optimal is 0.55, which reaches the performance peak.
The parameter is the maximum distance between samples within the same cluster, as shown in Fig. 5b. We analyze the effect of the parameter , and we observe that when is greater than a certain value, the modeling improvement is not significant. The main reason is that when the parameter reaches a certain degree, the original clusters tend to stabilize and no longer affect the model performance.
The parameter. Figure 5c illustrates the effect of the scaling factor , where a smaller cannot suppress small or split/merge distances and will affect the final performance of the model. Figure 5e illustrates the curves of split/merge distance with or for different . From Fig. 5e, we can see that the curves of are similar on certain intervals, in which case the performance of the algorithm is not sensitive to the value of . From Fig. 5c, we set = 200, which achieves an excellent performance on all datasets.
Sensitivity to cluster number. Figure 5e investigates the sensitivity of the algorithm to the number of clusters N. In order to separate samples with different identities in the same cluster, we use spectral clustering to divide each cluster into N parts, and then calculate the split distance between these N parts based on the of the clusters. Moreover, for pure clusters that do not contain multiple pedestrian IDs, the split distance is calculated to be small, so these clusters will not be split into multiple smaller clusters in the next iteration. As can be seen from Fig. 5e, although the value of N affects the final performance of the algorithmic framework, within a certain range our proposed framework is not sensitive to this hyperparameter.
The parameter is an amplification factor that is used to further amplify the difference in loss weights between the low-confidence and high-confidence samples. In order to study the sensitivity of the algorithm to , we vary from 1 to 6 to study the effect of the amplification factor . As shown in Fig. 5d, we found that the algorithm performance can reach the best accuracy when is set to 2.
Conclusion
In this paper, we propose a novel regularization that adjusts the neighborhood radius of DBSCAN to obtain clusters with varying degrees of tightness and looseness. We assume that compact clusters often contain less noise, and we introduce a regularization term on top of the existing cluster-level contrastive loss guided by silhouette coefficients. The regularization term increases the weight of the compact cluster in the contrastive loss for samples with lower silhouette coefficients, which effectively mitigates the negative impact of noisy pseudo-labels in the original clustering results. With the regularization term, the features extracted by the model become more robust, and the FMI metric, which measures the accuracy of the pseudo-labels, is further improved. We also demonstrate the effectiveness of our proposed algorithm through extensive performance comparisons and ablation experiments.
Author Contributions
YL and PS wrote the main manuscript text, LZ prepared figures 1, 3 and table 1, YF prepared figure 4, SJ prepared table 3, QZ prepared table 2, and CY prepared figure 2. All authors reviewed the manuscript.
Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
Declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Miao, Z; Zhang, Y; Piao, X; Chu, Y; Yin, B. Region feature smoothness assumption for weakly semi-supervised crowd counting. Comput. Animat. Virtual Worlds; 2023; 34, pp. 3-4. [DOI: https://dx.doi.org/10.1002/cav.2173]
2. Shi, J; Xiu, Y; Tang, G. Research on occlusion block face recognition based on feature point location. Comput. Animat. Virtual Worlds; 2022; 33, pp. 3-4. [DOI: https://dx.doi.org/10.1002/cav.2094]
3. Sun, L; Tang, T; Qu, Y; Qin, W. Bidirectional temporal feature for 3d human pose and shape estimation from a video. Comput. Animat. Virtual Worlds; 2023; 34, pp. 3-4. [DOI: https://dx.doi.org/10.1002/cav.2187]
4. Xu, Q; Liu, F; Fu, Z; Zhou, A; Qi, J. Aes-gcn: attention-enhanced semantic-guided graph convolutional networks for skeleton-based action recognition. Comput. Animat. Virtual Worlds; 2022; 33, pp. 3-4. [DOI: https://dx.doi.org/10.1002/cav.2070]
5. Jiang, N; Sheng, B; Li, P; Lee, T. Photohelper: portrait photographing guidance via deep feature retrieval and fusion. IEEE Trans. Multim.; 2023; 25, pp. 2226-2238. [DOI: https://dx.doi.org/10.1109/TMM.2022.3144890]
6. Sheng, B; Li, P; Ali, R; Chen, CLP. Improving video temporal consistency via broad learning system. IEEE Trans. Cybern.; 2022; 52,
7. Ma, A.J., Yuen, P.C., Li, J.: Domain transfer support vector ranking for person re-identification without target camera label information. In: 2013 IEEE International Conference on Computer Vision, pp. 3567–3574 (2013)
8. Deng, W., Zheng, L., Kang, G., Yang, Y., Ye, Q., Jiao, J.: Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 994–1003 (2017)
9. Yu, H.-X., Zheng, W., Wu, A., Guo, X., Gong, S., Lai, J.: Unsupervised person re-identification by soft multilabel learning. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2143–2152 (2019)
10. Zhong, Z., Zheng, L., Luo, Z., Li, S., Yang, Y.: Invariance matters: exemplar memory for domain adaptive person re-identification. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 598–607 (2019)
11. Lin, Y., Dong, X., Zheng, L., Yan, Y., Yang, Y.: A bottom-up clustering approach to unsupervised person re-identification. In: AAAI Conference on Artificial Intelligence (2019)
12. Zhang, X., Ge, Y., Qiao, Y., Li, H.: Refining pseudo labels with clustering consensus over generations for unsupervised object re-identification. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3435–3444 (2021)
13. Fan, H., Zheng, L., Yang, Y.: Unsupervised person re-identification: clustering and fine-tuning. arXiv:1705.10444 (2017)
14. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823 (2015)
15. Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv:1703.07737 (2017)
16. van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv:1807.03748 (2018)
17. Wang, D., Zhang, S.: Unsupervised person re-identification via multi-label classification. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10978–10987 (2020)
18. Ge, Y., Chen, D., Zhu, F., Zhao, R., Li, H.: Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. arXiv:2006.02713 (2020)
19. Si, T; He, F; Li, P. Hybrid feature constraint with clustering for unsupervised person re-identification. Vis. Comput.; 2023; 39,
20. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Knowledge Discovery and Data Mining (1996)
21. Kanungo, T; Mount, DM; Netanyahu, NS; Piatko, CD; Silverman, R; Wu, AY. An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell.; 2002; 24, pp. 881-892. [DOI: https://dx.doi.org/10.1109/TPAMI.2002.1017616]
22. Zheng, K., Lan, C., Zeng, W., Zhang, Z., Zha, Z.: Exploiting sample uncertainty for domain adaptive person re-identification. In: AAAI Conference on Artificial Intelligence (2020)
23. Rousseeuw, PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math.; 1987; 20, pp. 53-65. [DOI: https://dx.doi.org/10.1016/0377-0427(87)90125-7]
24. Lin, Y., Xie, L., Wu, Y., Yan, C.C., Tian, Q.: Unsupervised person re-identification via softened similarity learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3387–3396 (2020)
25. Cho, Y.H., Kim, W.J., Hong, S., Eui Yoon, S.: Part-based pseudo label refinement for unsupervised person re-identification. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7298–7308 (2022)
26. Xu, M; Guo, H; Jia, Y; Dai, Z; Wang, J. Pseudo label rectification with joint camera shift adaptation and outlier progressive recycling for unsupervised person re-identification. IEEE Trans. Intell. Transp. Syst.; 2023; 24,
27. Xie, Z; Zhang, W; Sheng, B; Li, P; Chen, CLP. Bagfn: broad attentive graph fusion network for high-order feature interactions. IEEE Trans. Neural Netw. Learn. Syst.; 2023; 34,
28. Sun, Y., Zheng, L., Yang, Y., Tian, Q., Wang, S.: Beyond part models: person retrieval with refined part pooling. In: European Conference on Computer Vision (2017)
29. Chen, Z; Qiu, G; Li, P; Zhu, L; Yang, X; Sheng, B. MNGNAS: distilling adaptive combination of multiple searched networks for one-shot neural architecture search. IEEE Trans. Pattern Anal. Mach. Intell.; 2023; 45,
30. Lin, X; Sun, S; Huang, W; Sheng, B; Li, P; Feng, DD. EAPT: efficient attention pyramid transformer for image processing. IEEE Trans. Multim.; 2023; 25, pp. 50-61. [DOI: https://dx.doi.org/10.1109/TMM.2021.3120873]
31. Li, J; Chen, J; Sheng, B; Li, P; Yang, P; Feng, DD; Qi, J. Automatic detection and classification system of domestic waste via multimodel cascaded convolutional neural network. IEEE Trans. Ind. Inf.; 2022; 18,
32. Cheng, H., Zhu, Z., Li, X., Gong, Y., Sun, X., Liu, Y.: Learning with instance-dependent label noise: a sample sieve approach. arXiv:2010.02347 (2020)
33. Zhao, C; Lv, X; Zhang, Z; Zuo, W; Wu, J; Miao, D. Deep fusion feature representation learning with hard mining center-triplet loss for person re-identification. IEEE Trans. Multimedia; 2020; 22, pp. 3180-3195. [DOI: https://dx.doi.org/10.1109/TMM.2020.2972125]
34. Ge, Y., Chen, D., Li, H.: Mutual mean-teaching: pseudo label refinery for unsupervised domain adaptation on person re-identification. arXiv:2001.01526 (2020)
35. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929 (2020)
36. He, S., Luo, H., Wang, P., Wang, F., Li, H., Jiang, W.: Transreid: transformer-based object re-identification. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 14993–15002 (2021)
37. Fu, D., Chen, D., Bao, J., Yang, H., Yuan, L., Zhang, L., Li, H., Chen, D.: Unsupervised pre-training for person re-identification. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14745–14754 (2020)
38. Ward, JH. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc.; 1963; 58, pp. 236-244.
39. Feng, Y., Zhao, S., Zhang, Y., Liu, Y., Zhu, S., Coleman, S. A.: Noise-tolerant learning with silhouette coefficient for unsupervised person re-identification. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2022)
40. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9726–9735 (2019)
41. Zhong, Z., Zheng, L., Cao, D., Li, S.: Re-ranking person re-identification with k-reciprocal encoding. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3652–3661 (2017)
42. von Luxburg, U. A tutorial on spectral clustering. Stat. Comput.; 2007; 17, pp. 395-416.
43. Zhai, Y., Ye, Q., Lu, S., Jia, M., Ji, R., Tian, Y.: Multiple expert brainstorming for domain adaptive person re-identification. arXiv:2007.01546 (2020)
44. Zheng, K., Liu, W., He, L., Mei, T., Luo, J., Zha, Z.: Group-aware label transfer for domain adaptive person re-identification. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5306–5315 (2021)
45. Li, S; Yuan, M; Chen, J; Hu, Z. Adadc: adaptive deep clustering for unsupervised domain adaptation in person re-identification. IEEE Trans. Circuits Syst. Video Technol.; 2022; 32, pp. 3825-3838. [DOI: https://dx.doi.org/10.1109/TCSVT.2021.3118060]
46. Peng, J; Jiang, G; Wang, H. Adaptive memorization with group labels for unsupervised person re-identification. IEEE Trans. Circuits Syst. Video Technol.; 2023; 33, pp. 5802-5813. [DOI: https://dx.doi.org/10.1109/TCSVT.2023.3258917]
47. Zeng, K., Ning, M., Wang, Y., Guo, Y.: Hierarchical clustering with hard-batch triplet loss for person re-identification. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13654–13662 (2019)
48. Wang, Z., Zhang, J., Zheng, L., Liu, Y., Sun, Y., Li, Y., Wang, S.: Cycas: self-supervised cycle association for learning re-identifiable descriptions. In: European Conference on Computer Vision (2020)
49. Dai, Z., Wang, G., Zhu, S., Yuan, W., Tan, P.: Cluster contrast for unsupervised person re-identification. In: Asian Conference on Computer Vision (2021)
50. Zhang, X., Li, D., Wang, Z., Wang, J., Ding, E., Shi, J. Q., Zhang, Z., Wang, J.: Implicit sample extension for unsupervised person re-identification. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7359–7368 (2022)
51. He, Q., Wang, Z., Zheng, Z., Hu, H.: Spatial and temporal dual-attention for unsupervised person re-identification. In: IEEE Transactions on Intelligent Transportation Systems (2023)
52. Lan, L; Teng, X; Zhang, J; Zhang, X; Tao, D. Learning to purification for unsupervised person re-identification. IEEE Trans. Image Process.; 2022; 32, pp. 3338-3353. [DOI: https://dx.doi.org/10.1109/TIP.2023.3278860]
53. Chen, Z; Cui, Z; Zhang, C; Zhou, J; Liu, Y. Dual clustering co-teaching with consistent sample mining for unsupervised person re-identification. IEEE Trans. Circuits Syst. Video Technol.; 2022; 33, pp. 5908-5920. [DOI: https://dx.doi.org/10.1109/TCSVT.2023.3261898]
54. Zhu, K., Guo, H., Yan, T., Zhu, Y., Wang, J., Tang, M.: Part-aware self-supervised pre-training for person re-identification. In: European Conference on Computer Vision (2022)
55. Luo, H., Wang, P., Xu, Y., Ding, F., Zhou, Y., Wang, F., Li, H., Jin, R.: Self-supervised pre-training for transformer-based person re-identification. arXiv:2111.12084 (2021)
56. Yang, E., Li, C., Liu, S., Liu, Y., Zhao, S., Huang, N.: Self-supervised pre-training with learnable tokenizers for person re-identification in railway stations. In: 2022 16th IEEE International Conference on Signal Processing (ICSP), vol. 1, pp. 325–330 (2022)
57. Tao, Y., Zhang, J., Chen, T., Wang, Y., Zhu, Y.: Transformer-based contrastive learning for unsupervised person re-identification. In: 2022 International Joint Conference on Neural Networks (IJCNN), pp. 1–9 (2022)
58. Ristani, E., Solera, F., Zou, R. S., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: ECCV Workshops (2016)
59. Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re-identification: a benchmark. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1116–1124 (2015)
60. Wei, L., Zhang, S., Gao, W., Tian, Q.: Person transfer gan to bridge domain gap for person re-identification. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 79–88 (2017)
61. Xing, E.P., Ng, A., Jordan, M.I., Russell, S.J.: Distance metric learning with application to clustering with side-information. In: NIPS (2002)
Copyright Springer Nature B.V. Jan 2025