Full Text

Turn on search term navigation

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Fine-grained image recognition aims to classify fine subcategories belonging to the same parent category, such as vehicle model or bird species classification. This is an inherently challenging task because a classifier must capture subtle interclass differences under large intraclass variances. Most previous approaches are based on supervised learning, which requires a large-scale labeled dataset. However, such large-scale annotated datasets for fine-grained image recognition are difficult to collect because they generally require domain expertise during the labeling process. In this study, we propose a self-supervised transfer learning method based on Vision Transformer (ViT) to learn finer representations without human annotations. Interestingly, it is observed that existing self-supervised learning methods using ViT (e.g., DINO) show poor patch-level semantic consistency, which may be detrimental to learning finer representations. Motivated by this observation, we propose a consistency loss function that encourages patch embeddings of the overlapping area between two augmented views to be similar to each other during self-supervised learning on fine-grained datasets. In addition, we explore effective transfer learning strategies to fully leverage existing self-supervised models trained on large-scale labeled datasets. Contrary to the previous literature, our findings indicate that training only the last block of ViT is effective for self-supervised transfer learning. We demonstrate the effectiveness of our proposed approach through extensive experiments using six fine-grained image classification benchmark datasets, including FGVC Aircraft, CUB-200-2011, Food-101, Oxford 102 Flowers, Stanford Cars, and Stanford Dogs. Under the linear evaluation protocol, our method achieves an average accuracy of 78.5%, outperforming the existing transfer learning method, which yields 77.2%.

Details

Title
Patch-Level Consistency Regularization in Self-Supervised Transfer Learning for Fine-Grained Image Recognition
Author
Lee, Yejin 1   VIAFID ORCID Logo  ; Lee, Suho 1   VIAFID ORCID Logo  ; Hwang, Sangheum 2   VIAFID ORCID Logo 

 Department of Data Science, Seoul National University of Science and Technology, Seoul 01811, Republic of Korea; [email protected] (Y.L.); [email protected] (S.L.) 
 Department of Data Science, Seoul National University of Science and Technology, Seoul 01811, Republic of Korea; [email protected] (Y.L.); [email protected] (S.L.); Department of Industrial & Information Systems Engineering, Seoul National University of Science and Technology, Seoul 01811, Republic of Korea; Research Center for Electrical and Information Technology, Seoul National University of Science and Technology, Seoul 01811, Republic of Korea 
First page
10493
Publication year
2023
Publication date
2023
Publisher
MDPI AG
e-ISSN
20763417
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
2869235415
Copyright
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.