This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Recent years have witnessed a surge of need in jointly analyzing multimodal data [1, 2]. As one of the fundamental problems of many multimodal applications, cross-modal retrieval aims to find semantically similar items from objects of different modalities (such as text, visual, or audio object) [3].
The modality gap is the main challenge of cross-modal retrieval [4, 5]. A common approach to bridge the modality gap is constructing a shared representation space where the multimodal samples can be represented uniformly [2]. However, it is not easy because it requires detailed knowledge of the content of each modality and the correspondence between them [6]. A variety of tools are used to construct the shared space, such as canonical correlation analysis (CCA) [1, 7–10], topic model [11–13], and hashing [14–18]. Among these methods, the deep neural network (DNN) has become the most popular one because of its strong learning ability [6, 19–24]. The performance of most of these methods, especially DNN-based methods, heavily depends on sufficient coupled cross-modal samples [25]. However, collecting coupled training data is labor-intensive and time-consuming.
Although it may not be explicitly announced, two types of relationships are essential considerations when constructing the shared representation space: the intermodal relation and the intramodal relation [5, 10]. They play critical roles in preserving the cross-modal similarity and the single-modal similarity, respectively [5, 10, 25]. Also, separate representation learning and shared representation learning in some existing works are preserving these two relationships [26, 27].
It should be noticed that the information to maintain these two relationships is different: the correspondence between cross-modal samples is essential to preserve intermodal relations, while the similarity relation between single-modal samples is indispensable to preserve intramodal relations [10]. Most of the existing methods, such as [5, 25, 28–30], only use coupled cross-modal samples to preserve intermodal relations and intramodal relations; however, uncoupled single-modal samples are discarded.
In many cross-domain learning tasks, such as machine translation [31–33], unlabeled samples in the single domain are significant. As a typical cross-domain learning task, cross-modal retrieval should also benefit from uncoupled training samples. Besides, in contrast to coupled training samples, uncoupled ones are easier to obtain. Thus, it is necessary to introduce uncoupled training samples into the construction of the shared representation space, especially when coupled ones are insufficient.
Inspired by the discussion above, a two-stage cross-modal retrieval framework is proposed. As illustrated in Figure 1, the proposed Abs-Ass framework uses training samples in a different way. In existing methods, only coupled training samples are used to preserve intramodal and intermodal relations. However, in this framework, both coupled and uncoupled samples are used to maintain intramodal relations; only a few coupled cross-modal sample pairs are used to maintain intermodal relations. Thus, the process of constructing the shared representation space is divided into two subprocesses: Abstraction that preserves intramodal relations and Association that preserves intermodal relations.
[figures omitted; refer to PDF]
The name Abstraction indicates that we need to consider the intramodal relation at the semantic level rather than the feature level. The name Association means that the process of preserving the intermodal relation is exactly finding the correlation between different modalities. Abstraction fully explores intramodal relations through uncoupled samples of each modality, which enables Association to recognize multimodal samples at a higher level; thus, Association can find the correlation between cross-modal samples much easier, even though only a few coupled training samples are available. In the ideal case, high-level representations of different modalities can be associated even with a linear transformation [34].
Moreover, following the framework above, we proposed a cross-modal retrieval method based on the reference-based representation and the correlation between the semantic structures of different modalities. Specifically, Abstraction is implemented by the reference-based representation [35], which represents multimodal objects through the semantic structure. The term semantic structure refers to all pairwise similarities of a set of
Through this paper, we demonstrate the importance of uncoupled samples for preserving intramodal relations and the correlation between semantic structures of different modalities, which together provide the possibility of cross-modal retrieval with limited coupled training samples. The main contribution can be summarized as follows: (1) Abs-Ass cross-modal retrieval framework. We propose a two-stage framework consisting of the Abstraction and the Association that emphasizes different roles of coupled and uncoupled training samples. In contrast to the end-to-end learning model, the proposed framework separates the process of preserving intermodal and intramodal relations into two stages and uses uncoupled single-modal samples and coupled cross-modal samples to learn them, respectively. Compared with the existing methods, the Abs-Ass framework improved the using efficiency of training samples and has lower demands for coupled training data. (2) Semantic structure-based cross-modal retrieval method. Following the Abs-Ass framework, we propose a cross-modal retrieval method by introducing the reference-based representation to represent multimodal data at the semantic level and proving the positive correlation between the semantic structures of different modalities. Although some existing works also try to find the cross-modal correlation from the semantic view [1, 3], the correlation between semantic structures naturally exists and has a fixed and straightforward pattern. Therefore, even a few coupled training samples are enough to align semantic structures of different modalities. Besides, the proposed method is unsupervised because the reference-based representation scheme does not need class labels.
The remainder of this paper is organized as follows. Section 2 introduces the related works of the cross-modal retrieval task. Section 3 introduces the proposed implementation of the Abs-Ass framework. Section 4 tests the proposed method through the experiments on public data sets.
2. Related Work
2.1. CCA-Based Methods
To the best of our knowledge, the first well-known cross-modal correlating model may be the CCA-based model proposed by Hardoon et al. [7]. It learns a linear projection to maximize the correlation between the representation of different modalities in the projected space. Inspired by this work, many CCA-based models are designed for cross-modal analyzing [1, 8–10, 37]. Rasiwasia et al. [1] utilized CCA to learn two maximally correlated subspaces, and multiclass logistic regression was performed within them to produce the semantic spaces, respectively. Mroueh et al. [9] proposed a truncated-SVD based algorithm to compute the full regularization path of CCA for multimodal retrieval efficiently. Wang et al. [10] developed a new hypergraph-based canonical correlation analysis (HCCA) to project low-level features into a shared space where intrapair and interpair correlation is maintained simultaneously. Liang et al. [37] incorporated the group correspondence and CCA to cross-modal retrieval.
2.2. Topic Model Methods
The topic model is also helpful for uniform representing of multimodal data, assuming that objects of different modalities share some latent topics. Latent Dirichlet allocation- (LDA-) based methods establish the shared space through the joint distribution of multimodal data and the conditional relation between them [11, 12]. Roller and Walde [12] integrated visual features into LDA and presented a multimodal LDA model to learn joint representations for text and visual data. Wang et al. [13] proposed the multimodal mutual topic reinforce model (
2.3. Hashing-Based Methods
For the rapid growth of data volume, the cost of finding the nearest neighbors cannot be dismissed. Hashing is a salable method for finding nearest neighbors approximately [14]. It projects data into a Hamming space, where the neighbor search can be performed efficiently. In order to improve the efficiency of finding similar multimodal objects, many cross-modal hashing methods have been proposed [14–18, 38, 39]. Kumar and Udupa [15] proposed a cross-view hashing method to generate such hash codes that minimized the distance in a Hamming space between similar objects and maximized that between dissimilar ones. Yi et al. [16] used a coregularization framework to generate such binary code that the hash codes from different modalities were consistent. Ou et al. [17] constructed a Hamming space for each modality and built the mapping between them with logistic regression. Wu et al. [18] proposed a sparse multimodal hashing method for cross-modal retrieval. Song et al. [38] proposed Self-Supervised Video Hashing (SSVH), which outperforms the state-of-the-art methods on unsupervised video retrieval. Ye and Peng [39] proposed Multiscale Correlation Sequential Cross-modal Hashing Learning (MCSCH) to utilize multiscale features of cross-modal data. Liu et al. [40] proposed the Matrix Tri-Factorization Hashing (MTFH) that discards the unified Hamming space to obtain higher representation scalability.
2.4. Deep Learning Methods
Due to the strong learning ability of the deep neural network, many deep models have been proposed for cross-modal learning, such as [6, 19–24, 26, 27, 41, 42]. Ngiam et al. [19] presented an autoencoder model to learn joint representations for speech audios and videos of the lip movements. Srivastava and Salakhutdinov [20] employed the restricted Boltzmann machine to learn a shared space between data of different modalities. Frome et al. [22] proposed a deep visual-semantic embedding (DeViSE) model to identify the visual objects using the information from the labeled image and unannotated text. Andrew et al. [21] introduced deep canonical correlation analysis to learn such nonlinear mapping between two views of data that the corresponding objects are linearly related in the representation space. Jiang et al. [23] proposed a real-time Internet cross-media retrieval method, in which deep learning was employed for feature extraction and distance detection. Due to the powerful representing ability of the convolutional neural network visual feature, Wei et al. [24] employed it coupled with a deep semantic matching method for cross-modal retrieval. Peng et al. [26, 27] proposed two-stage frameworks to learn the separate representation and the shared representation, which are implemented by cross-media multiple deep networks (CMDN) and cross-modal correlation learning (CCL), respectively. Song et al. [43] proposed multimodal stochastic RNNs (MS-RNN) for the video caption task, which solved a critical deficiency of the existing methods based on the encoder-decoder framework. Recently, the attention mechanism is playing an important role in maintaining the intermodal and intramodal relations. Qi et al. [41] proposed a visual-language relation attention model to explore the intermodal and intramodal relation between fine-grained patches, as well as the cross-media multilevel alignment to boost precise cross-media correlation learning. Gao et al. [42] proposed hierarchical LSTMs with adaptive attention for visual captioning.
Although these methods have achieved great success in multimodal learning, most of them need a mass of training data to learn the complex correlation between objects from different modalities. To reduce the demand for training data, some methods have been proposed from different views. Gao et al. [25] proposed an active similarity learning model for cross-modal data. Nevertheless, without extra information, improvement is limited. Chowdhury et al. [44] introduced additional web information to cross-modal retrieval.
3. Proposed Approach
The cross-modal retrieval can be formalized as follows. The multimodal data set
The process of our proposed method can be described as equations (1)
Second, represent the to-be-matched objects (the nonred points in Figure 2) of
[figure omitted; refer to PDF]
As illustrated in Section 3.3,
3.1. Feature Extraction for Text and Images
In the early research on cross-modal learning, the weak-effectiveness of low-level feature extraction is one of the main factors that limits the retrieval accuracy. The application of the CNN visual feature has significantly improved the accuracy of cross-modal retrieval [4, 24]. In contrast to the visual feature, some works still take the BoW (bag-of-word) as the default tool to extract text features [5, 29], which is not effective enough to model intramodal relations in text modality. The consistency of the semantic structure is beneficial to transferring learning tasks, including cross-modal retrieval [45]; thus, we take the pretrained CNN model and the sentence embedding with the pretrained word vector for feature extraction of images and text, respectively.
3.1.1. Pretrained Convolutional Neural Network for Feature Extraction of Images
CNN has demonstrated outstanding performance for various computer vision tasks, such as image classification and object detection. Wei et al. proposed to utilize the pretrained CNN for visual feature extraction in cross-modal retrieval [24], which performs much better than the low-level feature. Because we aim to reduce the dependency on training data, we directly take the pretrained VGG19 [46] (not fine-tuned) to extract the feature of images, namely, the mapping in equation (1).
3.1.2. Sentence Embedding for Feature Extraction of Text
The advancement of NLP techniques provides us powerful tools for text feature extraction. Given enough supervised information, a good end-to-end model can automatically extract the most important features; however, with limited training data, it is hard to train such a model. Instead, we take the pretrained word embedding and an unsupervised text embedding method for the feature extraction of the text, which is the mapping in equation (2).
Many text embedding methods for general NLP tasks can be helpful; among them, smooth inverse frequency (SIF) is a simple but a powerful sentence embedding method [47]. With the pretrained word vector (such as Glove [48]), SIF provides a completely unsupervised method to embed sentences into the semantic space, which can be summarized as equations (6) and (7). Given a sentence
3.2. Semantic Structure-Based Representation for Single-Modality Data
Although the extracting tools above provide more accurate features for image and text data, it is still hard to directly perform retrieval tasks on them, especially when only limited coupled training samples are available. Therefore, feature-level representations
In the unsupervised setting, the semantic structure-based representation (also named space structure-based representation, SSR) is a simple but effective way to preserve intramodal relations, as some unsupervised learning methods did [49, 50]. Given a set of samples
The reason for choosing cosine similarity lies in two aspects: on the one hand, the cosine similarity is normalized, and its value range is always
In equation (8), the dimensionality is very high for large data sets, which leads to high computational complexity of the representation and the follow-up task. Given a data set
[figure omitted; refer to PDF]
The reference set
[figure omitted; refer to PDF]
The clustering method should generate cluster centers of the sample form because the reference point is the real samples of the data set. However, many popular clustering methods can only generate cluster centers in the form of prototypes, such as
In the reference-based representation,
3.3. Cross-Modal Similarity Computing Based on the Correlation between Semantic Structures
The semantic structure-based representation provides different modalities with a homogeneous representation scheme. Moreover, if the reference set
This section proves that assuming
Although the assumption seems reasonable intuitively, it is hard to prove completely because the definition of the cross-modal similarity relation cannot be defined uniformly at the feature level. For simplicity, we discuss the case that similar cross-modal samples can be correlated through a linear transformation. The nonlinear case is not discussed because nonlinear mapping functions have much more complex and various forms; thus, it is difficult to discuss the nonlinear case comprehensively in a limited space. Besides, existing works [34, 36] have proved that nonlinear mapping functions have no obvious advantage over linear mapping in correlating cross-modal samples.
Following existing works [1, 25], we assume that similar cross-modal samples are correlated through a linear transformation:
In this way, we have the following proposition:
Proposition 1.
If the similar samples in
Proof.
We assume that
The Pearson correlation coefficient is used to measure the correlation between
Step 1.
Proving the denominator of equation (13) is positive.
Because neither
Step 2.
Factorizing the numerator of equation (13) by diagonalizing
The covariance between
Because
Substituting equation (17) into equation (15),
Step 3.
Computing the covariance through case-by-case discussion of equation (19).
Because
If
If
If
From equations (20)–(22), we know that the covariance in equation (19) is not zero only when
Step 4.
Proving the covariance is positive.
Because
The sum of eigenvalues equals the sum of the diagonal elements of
Equations (24) and (25) show that there exists at least one
Because the eigenvector
Finally, from equations (26) and (27), the covariance between
Step 5.
Proving the Pearson coefficient is positive.
From equations (14) and (28), the Pearson correlation coefficient is larger than zero:
In conclusion, for any
In the proposition and its proof, apart from being nonzero, we have no requirement for the mapping matrix
From Proposition 1, it can be inferred that
Because reference points are also real samples in
Thus, if all the reference points of
Moreover, we have
Although the analysis above is somewhat lengthy, the cross-modal similarity computation based on which is quiet simple. The core of similarity computation is a multimodal reference set
3.4. Multimodal Reference Selection Based on Active Learning
The multimodal reference set
In this section, we design an active learning-based strategy for the selection of the multimodal reference set
From the analysis in Section 3.2, there exists a positive correlation between semantic structures of different modalities. Hence, the neighbor structure of different modalities should be similar. Therefore, if
Thus, we select the reference points for one single modality as in Section 3.2; then, the corresponding samples in the other modality are used as the reference points of this modality, which are obtained by asking the oracle. It is also important to choose the reference point from which modality. It is recommended to choosing one that has a clear group structure, which can bring better performance. Also, the cost of matching should also be considered. For example, the cost of querying images from text and querying text from images is different.
Finally, combining the similarity computing method in Section 3.3 with the reference selecting method above, we propose the semantic structure matching with the active learning (SSM-AL) method in Algorithm 1.
Algorithm 1: SSM-AL.
Require: Two data set
Ensure: Cross-modal similarity matrix
(1)
Divide
(2)
(3)
for all
(4)
(5)
end for
(6)
(7)
(8)
for all
(9)
(10)
end for
First, the multimodal reference set
The computational complexity of SSM-AL is analyzed as follows. Given the retrieval problem between
4. Experiments
In this section, we perform some experiments to evaluate the performance of the proposed method.
4.1. Data Sets
We use four benchmark data sets to evaluate the performance of the proposed method: Pascal-Sentences [55], Wikipedia [1], XMedia [56], and MSCOCO [57].
Pascal-Sentences: a subset of Pascal VOC, which contains 1,000 pairs of images and the corresponding text description from twenty categories.
Wikipedia: a data set containing 2,866 pairs of images and text from ten categories. Each pair of image and text is extracted from Wikipedia’s articles [1].
XMedia: a publicly available data set consisting of five media types (text, image, video, audio, and 3D model). We only use the image and text data in this paper, i.e., 5,000 pairs of images and text from twenty categories.
MSCOCO: a large data set containing 123,287 images and their annotated sentences. Each image is annotated by five independent sentences.
Following the existing works [24, 30], we take 20% samples as the testing set for Wikipedia, Pascal-Sentences, and XMedia. The testing set of MSCOCO is split as [58, 59]. The scale of training sets is set small because we aim to test performance with insufficient training samples.
4.2. Evaluation Protocol
We compare the retrieval performance of the proposed method with eight baselines:
CCA [1]: with canonical correlation analysis (CCA), a shared space is learned for different modalities where they are maximally correlated.
HSNN [60]: the heterogeneous similarity is measured by the probability of two cross-modal objects belonging to the same semantic category, which is achieved by analyzing the homogeneous nearest neighbors of each object.
JRL [56]: through semisupervised regularization and sparse regularization, JRL learns a common space using semantic information.
JFSSL [61]: a multimodal graph regularization is used to preserve the intermodality and intramodality similarity relationships.
CMCP [62]: a novel cross-modal correlation propagation method considering both positive relation and negative relation between cross-modal objects.
JGRHML [63]: a joint graph-regularized heterogeneous metric learning method, which integrates the structure of different modalities into a joint graph regularization.
VSEPP [59]: a learning visual-semantic embedding technique for cross-modal retrieval, which introduces a simple change to common loss functions used for multimodal embeddings.
GXN [34]: a cross-modal feature embedding method that incorporates generative processes, which can well match images and sentences with complex content.
SSM-AL: the proposed method has two settings: reference selection based on text clustering, denoted as SSM-
Among these methods, CCA, VSEPP, GXN, and our proposed SSM-AL are unsupervised methods that do not use class labels completely; HSNN, JFSSL, and CMCP are supervised methods where class labels are necessary; and JRL and JGRHML are semisupervised methods that need class labels of some samples.
For Pascal-Sentences, Wikipedia, and XMedia, a query item and a target item are considered actually similar if they share the same class label [30]. Mean average precision (MAP) is used to evaluate the performance in these data sets, which is a widely used metric of information retrieval [64]:
In contrast to the other three data sets, MSCOCO has no definite class labels. Following [28, 30], we considered a query and a target are actually similar only if they are the coupled image-text pair from the data set, and take the score of Recall @
4.3. Retrieval Performance Comparisons
In this section, we compare the bidirectional (image-to-text and text-to-image) retrieval performance of SSM-AL and the baselines. The number of coupled cross-modal training samples (also the number of clusters and reference points in SSM-AL) is denoted as
MAP values of some methods with small training sets are not reported in Tables 1–3 because they cannot finish with such limited training data, which are marked with “—.” Besides, to evaluate the impact of the number of reference points on the retrieval performance, we draw the MAP-
(1)
The result of Pascal-Sentences: in Table 1, SSM-
From Figure 5, in general, more reference points always bring higher performance in both retrieval tasks. The increasing speed is high when
(2)
The result of Wikipedia: in Table 2, MAP values of SSM-
In Figure 6, the MAP values of SSM-
(3)
The result of XMedia: in Table 3, SSM-
In Figure 7, the MAP value of SSM-
(4)
The result of MSCOCO: in Figure 8, SSM-
Table 1
MAP of the bidirectional retrieval task on Pascal-Sentences.
|
Task | CCA | SSM- |
SSM- |
JRL | JFSSL | CMCP | JGRHML | HSNN | VSEPP | GXN |
---|---|---|---|---|---|---|---|---|---|---|---|
10 | Image to text | 0.1275 | 0.2161 | 0.2263 | — | — | 0.1372 | — | — | 0.0875 | 0.0769 |
Text to image | 0.1275 | 0.2005 | 0.2043 | — | — | 0.1193 | — | — | 0.0814 | 0.0799 | |
|
|||||||||||
50 | Image to text | 0.1252 | 0.3207 | 0.3484 | 0.1275 | 0.2103 | 0.1485 | 0.1275 | 0.1271 | 0.1215 | 0.1186 |
Text to image | 0.0831 | 0.2999 | 0.3174 | 0.1275 | 0.2169 | 0.1359 | 0.1275 | 0.1636 | 0.1207 | 0.1148 | |
|
|||||||||||
100 | Image to text | 0.0899 | 0.3354 | 0.3764 | 0.1275 | 0.1010 | 0.3441 | 0.2407 | 0.2625 | 0.1331 | 0.1429 |
Text to image | 0.0704 | 0.3189 | 0.3482 | 0.1275 | 0.1058 | 0.2981 | 0.2932 | 0.2724 | 0.1322 | 0.1455 |
Table 2
MAP of the bidirectional retrieval task on Wikipedia.
|
Task | CCA | SSM- |
SSM- |
JRL | JFSSL | CMCP | JGRHML | HSNN | VSEPP | GXN |
---|---|---|---|---|---|---|---|---|---|---|---|
10 | Image to text | 0.1215 | 0.2103 | 0.1868 | — | — | 0.1761 | — | — | 0.1223 | 0.1201 |
Text to image | 0.1215 | 0.1848 | 0.1703 | — | — | 0.1222 | — | — | 0.1212 | 0.1208 | |
|
|||||||||||
50 | Image to text | 0.1678 | 0.2575 | 0.2268 | 0.2092 | 0.1163 | 0.1904 | 0.1993 | 0.2084 | 0.1238 | 0.1229 |
Text to image | 0.1138 | 0.2273 | 0.2051 | 0.2092 | 0.1095 | 0.1506 | 0.1612 | 0.1438 | 0.1266 | 0.1259 | |
|
|||||||||||
100 | Image to text | 0.1222 | 0.2621 | 0.2243 | 0.2092 | 0.1121 | 0.2592 | 0.2153 | 0.2321 | 0.1275 | 0.1270 |
Text to image | 0.1059 | 0.2438 | 0.2197 | 0.2092 | 0.1098 | 0.2063 | 0.1767 | 0.1817 | 0.1285 | 0.1289 |
Table 3
MAP of the bidirectional retrieval task on XMedia.
|
Task | CCA | SSM- |
SSM- |
JRL | JFSSL | CMCP | JGRHML | HSNN | VSEPP | GXN |
---|---|---|---|---|---|---|---|---|---|---|---|
40 | Image to text | 0.0569 | 0.2827 | 0.2781 | — | — | 0.2317 | 0.0569 | 0.0569 | 0.0691 | 0.0689 |
Text to image | 0.0569 | 0.3485 | 0.3350 | — | — | 0.2590 | 0.0569 | 0.0737 | 0.0510 | 0.0501 | |
|
|||||||||||
200 | Image to text | 0.0566 | 0.6454 | 0.6421 | 0.0572 | 0.1297 | 0.2403 | 0.2018 | 0.0800 | 0.0683 | 0.0644 |
Text to image | 0.0568 | 0.7365 | 0.7355 | 0.0572 | 0.4161 | 0.4139 | 0.3345 | 0.1161 | 0.0505 | 0.0517 | |
|
|||||||||||
400 | Image to text | 0.0574 | 0.6902 | 0.6729 | 0.4573 | 0.0669 | 0.2971 | 0.1410 | 0.1801 | 0.0663 | 0.0671 |
Text to image | 0.0580 | 0.7775 | 0.7629 | 0.4869 | 0.0832 | 0.5038 | 0.1626 | 0.2101 | 0.0481 | 0.0489 |
[figures omitted; refer to PDF]
[figures omitted; refer to PDF]
[figures omitted; refer to PDF]
In Figure 9, Recall @
[figures omitted; refer to PDF]
[figures omitted; refer to PDF]
It can be concluded that the proposed SSM-
Overall, more reference points are beneficial to the retrieval performance of SSM-AL: in most cases, the performance of SSM-AL improves with more reference points. However, it does not hold that the more the reference points, the better. On the one hand, more reference points mean higher costs for matching cross-modal samples; on the other hand, the performance gaining becomes limited when the number of reference points is high. In practice, the number of reference points should be decided according to the performance demand and the cost.
The above results also show that the retrieval performance is different when using different data as the basis of reference point selection. SSM-
5. Conclusion
In this paper, we try to improve the performance of cross-modal retrieval when training data are insufficient. Different from existing works, our proposed framework and its implementation emphasize the intramodal relation learning from data itself; no additional information (such as class labels and annotations from the web) is used as supplementary. The idea of this work is meaningful, especially when coupled training samples are insufficient; thus, it can be very helpful when applying cost is an essential consideration. Also, it can be incorporated in other methods to solve the cold-start problem of the cross-modal retrieval task.
The future work lies in two folds. On the one hand, this work can be improved by incorporating the class labels of a few samples when aligning the semantic structure of different modalities. On the other hand, we attempt to extend this work to other modalities, such as video and audio.
Acknowledgments
This work was partially funded by the National Natural Science Foundation of China (nos. 91648204 and 61532007), the National Key Research and Development Program of China (nos. 2017YFB1001900 and 2017YFB1301104), the National Science Foundation for Young Scientists of China (Grant no. 61802426), and the National Science and Technology Major Project.
[1] N. Rasiwasia, J. C. Pereira, E. Coviello, "A new approach to cross-modal multimedia retrieval," Proceedings of the International Conference on Multimedia, pp. 251-260, DOI: 10.1145/1873951.1873987, .
[2] Y. Peng, X. Huang, Y. Zhao, "An overview of cross-media retrieval: concepts, methodologies, benchmarks, and challenges," IEEE Transactions on Circuits and Systems for Video Technology, vol. 28 no. 9, pp. 2372-2385, DOI: 10.1109/tcsvt.2017.2705068, 2018.
[3] P. J. Costa, E. Coviello, G. Doyle, "On the role of correlation and abstraction in cross-modal multimedia retrieval," IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 36 no. 3, pp. 521-535, DOI: 10.1109/TPAMI.2013.142, 2014.
[4] Y.-x. Peng, W.-w. Zhu, Y. Zhao, "Cross-media analysis and reasoning: advances and directions," Frontiers of Information Technology & Electronic Engineering, vol. 18 no. 1, pp. 44-57, DOI: 10.1631/fitee.1601787, 2017.
[5] B. Wang, Y. Yang, X. Xu, A. Hanjalic, H. T. Shen, "Adversarial cross-modal retrieval," pp. 154-162, .
[6] A. Karpathy, A. Joulin, F. F. Li, "Deep fragment embeddings for bidirectional image sentence mapping," roceedings of the Advances In Neural Information Processing Systems, pp. 1889-1897, .
[7] D. R. Hardoon, S. Szedmak, J. Shawe-Taylor, "Canonical correlation analysis: an overview with application to learning methods," Neural Computation, vol. 16 no. 12, pp. 2639-2664, DOI: 10.1162/0899766042321814, 2004.
[8] A. Sharma, A. Kumar, H. Daume, D. W. Jacobs, "Generalized multiview analysis: a discriminative latent space," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2160-2167, DOI: 10.1109/cvpr.2012.6247923, .
[9] Y. Mroueh, E. Marcheret, V. Goel, "Multimodal retrieval with asymmetrically weighted truncated-svd canonical correlation analysis," 2015. http://arxiv.org/abs/1511.06267
[10] L. Wang, W. Sun, Z. Zhao, F. Su, "Modeling intra- and inter-pair correlation via heterogeneous high-order preserving for cross-modal retrieval," Signal Processing, vol. 131, pp. 249-260, DOI: 10.1016/j.sigpro.2016.08.012, 2017.
[11] Y. Jia, M. Salzmann, T. Darrell, "Learning cross-modality similarity for multinomial data," Proceedings of the International Conference on Computer Vision, pp. 2407-2414, DOI: 10.1109/iccv.2011.6126524, .
[12] S. Roller, S. S. I. Walde, "A multimodal lda model integrating textual, cognitive and visual modalities," Proceedings of the Conference on Empirical Methods In Natural Language Processing, pp. 1146-1157, .
[13] Y. Wang, F. Wu, J. Song, X. Li, Y. Zhuang, "Multi-modal mutual topic reinforce modeling for cross-media retrieval," Proceedings of the ACM International Conference on Multimedia, pp. 307-316, DOI: 10.1145/2647868.2654901, .
[14] J. Wang, S. Kumar, S. Chang, "Sequential projection learning for hashing with compact codes," Proceedings of the International Conference on Machine Learning, pp. 1127-1134, .
[15] S. Kumar, R. Udupa, "Learning hash functions for cross-view similarity search," Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1360-1365, .
[16] Z. Yi, D. Y. Yeung, "Co-regularized hashing for multimodal data," Proceedings of the Advances in Neural Information Processing Systems, pp. 1376-1384, .
[17] M. Ou, P. Cui, F. Wang, J. Wang, W. Zhu, S. Yang, "Comparing apples to oranges: a scalable solution with heterogeneous hashing," Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 230-238, DOI: 10.1145/2487575.2487668, .
[18] F. Wu, Z. Yu, Y. Yang, S. Tang, Y. Zhang, Y. Zhuang, "Sparse multi-modal hashing," IEEE Transactions on Multimedia, vol. 16 no. 2, pp. 427-439, DOI: 10.1109/tmm.2013.2291214, 2014.
[19] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A. Y. Ng, "Multimodal deep learning," Proceedings of the International Conference on Machine Learning, pp. 689-696, .
[20] N. Srivastava, R. Salakhutdinov, "Multimodal learning with deep Boltzmann machines," Proceedings of the Advances in Neural Information Processing Systems, pp. 2222-2230, .
[21] G. Andrew, R. Arora, J. A. Bilmes, K. Livescu, "Deep canonical correlation analysis," Proceedings of the International Conference on Machine Learning, pp. 1247-1255, .
[22] A. Frome, G. S. Corrado, J. Shlens, "A deep visual-semantic embedding model," Proceedings of the Advances in Neural Information Processing Systems, pp. 2121-2129, .
[23] B. Jiang, J. Yang, Z. Lv, K. Tian, Q. Meng, Y. Yan, "Internet cross-media retrieval based on deep learning," Journal of Visual Communication and Image Representation, vol. 48, pp. 356-366, DOI: 10.1016/j.jvcir.2017.02.011, 2017.
[24] Y. Wei, Y. Zhao, C. Lu, "Cross-modal retrieval with cnn visual features: a new baseline," IEEE Transactions on Cybernetics, vol. 47 no. 47, pp. 449-460, DOI: 10.1109/TCYB.2016.2519449, 2017.
[25] N. Gao, S.-J. Huang, Y. Yan, S. Chen, "Cross modal similarity learning with active queries," Pattern Recognition, vol. 75, pp. 214-222, DOI: 10.1016/j.patcog.2017.05.011, 2018.
[26] Y. Peng, X. Huang, J. Qi, "Cross-media shared representation by hierarchical learning with multiple deep networks," Proceedings of the International Joint Conference on Artificial Intelligence, pp. 3846-3853, .
[27] Y. Peng, J. Qi, X. Huang, Y. Yuan, "CCL: cross-modal correlation learning with multigrained fusion by hierarchical network," IEEE Transactions on Multimedia, vol. 20 no. 2, pp. 405-420, DOI: 10.1109/tmm.2017.2742704, 2018.
[28] F. Yan, K. Mikolajczyk, "Deep correlation for matching images and text," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3441-3450, DOI: 10.1109/cvpr.2015.7298966, .
[29] Y. Zhan, J. Yu, Z. Yu, R. Zhang, D. Tao, Q. Tian, "Comprehensive distance-preserving autoencoders for cross-modal retrieval," pp. 1137-1145, DOI: 10.1145/3240508.3240607, .
[30] Y. Peng, J. Qi, Y. Yuan, "Modality-specific cross-modal similarity measurement with recurrent attention network," IEEE Transactions on Image Processing, vol. 27 no. 11, pp. 5585-5599, DOI: 10.1109/tip.2018.2852503, 2018.
[31] A. Klementiev, I. Titov, B. Bhattarai, "Inducing cross lingual distributed representations of words," Proceedings of the International Conference on Computational Linguistics, pp. 1459-1474, .
[32] T. Mikolov, Q. V. Le, I. Sutskever, "Exploiting similarities among languages for machine translation," 2013. http://arxiv.org/abs/1309.4168
[33] S. Gouws, Y. Bengio, G. Corrado, "Bilbowa: fast bilingual distributed representations without word alignments," Proceedings of the International Conference on Machine Learning, pp. 748-756, .
[34] J. Gu, J. Cai, S. R. Joty, L. Niu, G. Wang, "Look, imagine and match: improving textual-visual cross-modal retrieval with generative models," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7181-7189, DOI: 10.1109/cvpr.2018.00750, .
[35] Q. Zheng, X. Diao, J. Cao, "From whole to part: reference-based representation for clustering categorical data," IEEE Transactions on Neural Networks and Learning Systems, vol. 31 no. 3, pp. 927-937, DOI: 10.1109/tnnls.2019.2911118, 2020.
[36] G. Collell, M.-F. Moens, "Do neural network cross-modal mappings really bridge modalities?," Proceedings 56th of the Annual Meeting of the Association for Computational Linguistics (ACL),DOI: 10.18653/v1/p18-2074, .
[37] J. Liang, R. He, Z. Sun, T. Tan, "Group-invariant cross-modal subspace learning," Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 1739-1745, .
[38] J. Song, H. Zhang, X. Li, L. Gao, M. Wang, R. Hong, "Self-supervised video hashing with hierarchical binary auto-encoder," IEEE Transactions on Image Processing, vol. 27 no. 7, pp. 3210-3221, DOI: 10.1109/tip.2018.2814344, 2018.
[39] Z. Ye, Y. Peng, "Multi-scale correlation for sequential cross-modal hashing learning," pp. 852-860, .
[40] X. Liu, Z. Hu, H. Ling, Y.-m. Cheung, "MTFH: a matrix tri-factorization hashing framework for efficient cross-modal retrieval," 2018. http://arxiv.org/abs/1805.01963
[41] J. Qi, Y. Peng, Y. Yuan, "Cross-media multi-level alignment with relation attention network," pp. 892-898, DOI: 10.24963/ijcai.2018/124, .
[42] L. Gao, X. Li, J. Song, H. T. Shen, "Hierarchical LSTMS with adaptive attention for visual captioning," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42 no. 5, pp. 1112-1131, DOI: 10.1109/tpami.2019.2894139, 2019.
[43] J. Song, Y. Guo, L. Gao, X. Li, A. Hanjalic, H. T. Shen, "From deterministic to generative: multimodal stochastic rnns for video captioning," IEEE Transactions on Neural Networks and Learning Systems, vol. 30 no. 10, pp. 3047-3058, DOI: 10.1109/tnnls.2018.2851077, 2019.
[44] N. C. Mithun, R. Panda, E. E. Papalexakis, A. K. Roy-Chowdhury, "Webly supervised joint embedding for cross-modal image-text retrieval," Proceedings of the ACM International Conference on Multimedia,DOI: 10.1145/3240508.3240712, .
[45] Y. Li, D. Wang, H. Hu, Y. Lin, Y. Zhuang, "Zero-shot recognition using dual visual-semantic mapping paths," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5207-5215, DOI: 10.1109/cvpr.2017.553, .
[46] K. Simonyan, A. Zisserman, "Very deep convolutional networks for large-scale image recognition," 2014. http://arxiv.org/abs/1409.1556
[47] S. Arora, Y. Liang, T. Ma, "A simple but tough-to-beat baseline for sentence embeddings," Proceedings of the International Conference on Learning Representations, .
[48] J. Pennington, R. Socher, C. Manning, "Glove: global vectors for word representation," Proceedings of the Conference on Empirical Methods in Natural Language Processing,DOI: 10.3115/v1/d14-1162, .
[49] Y. Qian, F. Li, J. Liang, B. Liu, C. Dang, "Space structure and clustering of categorical data," IEEE Transactions on Neural Networks and Learning Systems, vol. 27 no. 10, pp. 2047-2059, DOI: 10.1109/tnnls.2015.2451151, 2016.
[50] A. Y. Ng, M. I. Jordan, Y. Weiss, "On spectral clustering: analysis and an algorithm," Proceedings of the Advances in Neural Information Processing Systems, pp. 849-856, .
[51] H.-S. Park, C.-H. Jun, "A simple and fast algorithm for k-medoids clustering," Expert Systems with Applications, vol. 36 no. 2, pp. 3336-3341, DOI: 10.1016/j.eswa.2008.01.039, 2009.
[52] A. McCallum, K. Nigam, L. H. Ungar, "Efficient clustering of high-dimensional data sets with application to reference matching," Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169-178, DOI: 10.1145/347090.347123, .
[53] A. J. Bell, T. J. Sejnowski, "The “independent components” of natural scenes are edge filters," Vision Research, vol. 37 no. 23, pp. 3327-3338, DOI: 10.1016/s0042-6989(97)00121-1, 1997.
[54] A. C. Koivunen, A. B. Kostinski, "The feasibility of data whitening to improve performance of weather radar," Journal of Applied Meteorology, vol. 38 no. 6, pp. 741-749, DOI: 10.1175/1520-0450(1999)038<0741:tfodwt>2.0.co;2, 1999.
[55] C. Rashtchian, P. Young, M. Hodosh, J. Hockenmaier, "Collecting image annotations using amazon’s mechanical turk," Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 139-147, .
[56] X. Zhai, Y. Peng, J. Xiao, "Learning cross-media joint representation with sparse and semisupervised regularization," IEEE Transactions on Circuits and Systems for Video Technology, vol. 24 no. 6, pp. 965-978, DOI: 10.1109/tcsvt.2013.2276704, 2014.
[57] T. Y. Lin, M. Maire, S. Belongie, "Microsoft coco: common objects in context," pp. 740-755, .
[58] A. Karpathy, L. Fei-Fei, "Deep visual-semantic alignments for generating image descriptions," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128-3137, DOI: 10.1109/cvpr.2015.7298932, .
[59] F. Faghri, D. J. Fleet, J. R. Kiros, S. Fidler, "Vse++: improving visual-semantic embeddings with hard negatives," Proceedings of the British Machine Vision Conference (BMVC), .
[60] X. Zhai, Y. Peng, J. Xiao, "Effective heterogeneous similarity measure with nearest neighbors for cross-media retrieval," pp. 312-322, .
[61] K. Wang, R. He, L. Wang, W. Wang, T. Tan, "Joint feature selection and subspace learning for cross-modal retrieval," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38 no. 10, pp. 2010-2023, DOI: 10.1109/tpami.2015.2505311, 2016.
[62] X. Zhai, Y. Peng, J. Xiao, "Cross-modality correlation propagation for cross-media retrieval," Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2337-2340, DOI: 10.1109/icassp.2012.6288383, .
[63] X. Zhai, Y. Peng, J. Xiao, "Heterogeneous metric learning with joint graph regularization for cross-media retrieval," Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1198-1204, .
[64] N. Rasiwasia, P. J. Moreno, N. Vasconcelos, "Bridging the gap: query by semantic example," IEEE Transactions on Multimedia, vol. 9 no. 5, pp. 923-938, DOI: 10.1109/tmm.2007.900138, 2007.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2020 Qibin Zheng et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. http://creativecommons.org/licenses/by/4.0/
Abstract
Cross-modal retrieval aims to find relevant data of different modalities, such as images and text. In order to bridge the modality gap, most existing methods require a lot of coupled sample pairs as training data. To reduce the demands for training data, we propose a cross-modal retrieval framework that utilizes both coupled and uncoupled samples. The framework consists of two parts: Abstraction that aims to provide high-level single-modal representations with uncoupled samples; then, Association links different modalities through a few coupled training samples. Moreover, under this framework, we implement a cross-modal retrieval method based on the consistency between the semantic structure of multiple modalities. First, both images and text are represented with the semantic structure-based representation, which represents each sample as its similarity from the reference points that are generated from single-modal clustering. Then, the reference points of different modalities are aligned through an active learning strategy. Finally, the cross-modal similarity can be measured with the consistency between the semantic structures. The experiment results demonstrate that given proper abstraction of single-modal data, the relationship between different modalities can be simplified, and even limited coupled cross-modal training data are sufficient for satisfactory retrieval accuracy.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer