Content area
Vision-language models with large-scale image-text pairs have shown significant potential on representation learning. Human pose estimation task, which is highly sensitive to pixel-wise transformation, requires effective methods for mining pose-specific knowledge. In this paper, we investigate the homologous human pose retrieval task relying on large-scale annotated datasets to enhance pose knowledge extraction. We propose Pose Prompt (PosePro), which leverages vision-language models to categorize global pose configuration of an image, build compatible design, generate pose embedding as proposals. We then aim to integrate the learned knowledge as visual and textual prompt to facilitate the learning processing of newly unseen tasks. We demonstrate the effectiveness of fundamental PosePro model through extensive experiments on both pose retrieval and human pose estimation, showing significant improvements in accuracy and generalization ability, especially in scenarios with limited samples.
Details
; Xie, Xuemei 2
; Fu, Li 1 1 School of Artificial Intelligence, Xidian University , Xi’an, Shaanxi 710071 , PR China
2 Guangzhou Institute of Technology, Xidian University , Guangzhou, Guangdong 510555 , PR China
