Content area
Light field imaging has been widely acknowledged for its ability to capture both spatial and angular information of a scene, which can improve the performance of salient object detection (SOD) in complex environments. Existing approaches based on refocused images mainly explore the spatial features of different focus areas, while methods based on multi-view images are plagued by limitations such as data redundancy and high computational costs. In this study, we introduce a novel discrete viewpoint selection scheme to mitigate data redundancy. We also leverage the geometric characteristics of light field multi-view images to design a disparity extraction module that extracts disparity-relatedness between the selected viewpoints. Additionally, we construct a multi-feature fusion-feedback module to achieve mutual fusion of multiple features including spatial, edge, and depth for more accurate SOD. To validate our approach, we compare it with 12 existing methods on three datasets, and our results demonstrate a balance between multi-view image redundancy and model performance. Our method accurately locates salient objects even in challenging scenarios such as multiple objects and complex backgrounds, thereby achieving high-precision SOD.
Introduction
Salient object detection (SOD) aims to rapidly and effectively extract the most striking objects or regions in a scene. Accurate SOD results can provide useful prior information for various computer vision tasks such as image segmentation, visual tracking, autonomous driving, and video compression [1, 2, 3–4].
Existing SOD methods can be roughly divided into three categories based on the type of input data: RGB images, RGB-D images, and images extracted from light field data. RGB images, as the most easily obtainable data type in daily life, were initially the focus of researchers in this field [5, 6, 7, 8, 9, 10–11]. Traditional methods based on RGB images usually obtain saliency maps based on low-level features such as color, shape, background, and texture [5, 6], and can achieve ideal results in simple scenes. However, they exhibit reduced effectiveness in challenging scenes characterized by factors such as object-background similarity, low contrast, complex background, and occlusion. Although deep learning-based methods can further obtain high-level features of images, the lack of depth information in RGB images itself limits their further development. In response to these problems, researchers have introduced RGB-D images, which contain depth information, into SOD tasks [12, 13, 14, 15, 16, 17, 18–19]. Accurate depth maps can provide clear scene depth and structural information for SOD tasks. However, such methods impose stringent quality requirements on the depth maps they employ. RGB-D-based methods are limited by the quality of depth maps that can be obtained from depth cameras, which often impedes their ability to achieve optimal results. With the continuous development of imaging technology, light field imaging technology provides a new choice for SOD. Light fields can record the intensity information of light rays from different directions in the same scene, and can simultaneously provide multi-angle geometric information, which provides more information support for obtaining high-quality saliency maps.
Fig. 1 [Images not available. See PDF.]
Two-plane parameterization representation of 4D light field
Modern handheld plenoptic cameras typically capture light field images by inserting a micro-lens array between the main lens and the sensor of a traditional camera [20]. In contrast to RGB images obtained via conventional cameras or depth maps derived from depth sensors, plenoptic cameras capture light field data that encompasses a more holistic and detailed representation of natural scenes. This rich data includes crucial information such as depth cues, focusness indicators, and angular variations. As shown in Fig. 1, researchers typically use a double-plane parameterization L(x, y, u, v) to represent the light field, where (x, y) and (u, v) represent the spatial and angular coordinates, respectively [21]. Currently, two forms of light field data are mainly used for SOD: focal stack images and multi-view images. For the former, the light field data is transformed into images focused on different depth planes using computational imaging formulas [20], where a specific area in the scene is focused, and other areas not in the focus plane are blurred. The features of multiple focused areas are integrated to perform SOD. Although methods using focal stack images can achieve ideal saliency maps [22, 23], most of them only fuse the features of focus slices and all-in-focus images, without mining the relationship between different focused areas, which limits the use of rich light field information. In addition, the local blur in different focus slices is not conducive to obtaining saliency maps with sharp edges, and it is difficult to obtain ideal detection results when the scene depth range is narrow. For the latter, as shown in Fig. 2, multi-view images can record scene images at the same spatial location but from different viewpoints under each micro-lens. Since each multi-view image is all-in-focus, it is advantageous to extract images with sharp edges. However, if all viewpoints are directly used as input for SOD, there will be a large amount of data redundancy and tremendous computational costs. If only a part of the continuous viewpoints are selected as input, it is difficult to maximize the use of disparity information.
Fig. 2 [Images not available. See PDF.]
Parallax from different viewpoints in multi-view images
To address these issues, we make comparisons between different viewpoint selection methods, and present a novel discrete viewpoint selection scheme that enables us to achieve a large disparity range while using less light field data. Additionally, we propose an end-to-end network with novel structure to fuse the spatial, edge, and depth information from the light field data for obtaining more precise saliency maps.
To summarize, the contributions of this work are as follows:
We introduce a novel discrete viewpoint selection scheme based on the geometric features of multi-view images to obtain a large disparity range with less light field data, while ensuring unbiased viewpoint distribution.
We design a disparity extraction module (DEM) to reduce data redundancy and obtain effective disparity information by extracting common and differential features from viewpoints in different angular directions.
We propose an effective fusion-feedback module (FFM) to achieve more precise saliency maps for accurate SOD by integrating spatial, edge, and depth information of the light field data.
Related work
SOD based on 2D inputs
During the nascent phase of SOD research, scholars drew inspiration from the visual perception and neural architecture of primates [24], and predominantly utilized low-level, manually crafted features, including color, regional contrast, and so forth [5, 6, 7–8]. However, these methods were difficult to achieve highly consistent prediction results with human perception using only low-level features. With the continuous development of computer hardware, more and more SOD models based on deep learning have been proposed. Qin et al. [25] proposed a prediction–refinement model architecture that focused on the boundary quality of saliency maps and achieved clear prediction of the boundary structure through the proposed residual refinement module and mixed loss. Wei et al. [26] addressed the feature fusion problem arising from different convolutional layers by proposing a cross-feature module that adaptively selected complementary information in the input features to selectively aggregate multi-level features, effectively avoiding data redundancy. Liu et al. [27] investigated the role of pooling in SOD in neural networks and proposed two pooling-based modules for refining and fusing semantic features from different feature levels to achieve effective sharpening of saliency maps. Qin et al. [28] nested two U-Net networks and effectively deepened the network depth by increasing a small amount of computational cost, thereby capturing more contextual details and further extracting multi-scale features. Wu et al. [29] proposed a stacked cross-refinement network that utilized stack cross-refinement units to refine two independent multi-scale features for SOD and edge detection. Li et al. [30] proposed an end-to-end recursive neural network encoder for SOD in video images, enhancing the temporal correlation of each frame’s feature by capturing the change information of the images and using the sequence feature evolution of LSTM network encoding. However, the limitation of the lack of depth information in RGB images itself restricts the further development of SOD based on 2D inputs.
SOD based on 3D inputs
In order to supplement structural information in the scene, researchers have introduced RGB-D images with depth information into this field. Due to the varying quality of depth maps, extracting effective depth information to supplement saliency map information has become a research focus. Zhang et al. [12] designed a complimentary interaction module (CIM) to fuse global and local features in the scene. The CIM selectively extracts useful features from both RGB and depth data, and effectively integrates them into cross-modal features to generate saliency maps with clear edges. Fan et al. proposed a bifurcated backbone strategy (BBS) from the perspective of optimal fusion. The BBS divides multi-level features into teacher and student features, and then uses a depth-enhanced module to extract depth clues from channel and spatial dimensions to complement RGB and depth features [13]. Zhang et al. [14] proposed a generative model that maps input images and latent variables to random saliency predictions and updates latent variables by sampling from the true or approximate posterior distribution through an inference model. Unlike models that treat RGB and depth information as separate, Fu et al. [15] proposed an architecture for joint learning and tight collaborative fusion, with the former used for robust saliency feature learning and the latter for providing meaningful complementary features. However, the high-quality requirements for depth maps in SOD based on 3D inputs limit further development, as depth maps obtained by depth cameras have varying quality.
Table 1. Related work in the SOD area that investigates the best combinations of RGB images, depth, angle, boundary, feature fusion, and redundancy reduction
Paper | RGB | Depth | Angle | Boundary | Feature fusion | Reduce redundancy |
|---|---|---|---|---|---|---|
[5] | – | – | – | – | – | |
[25] | – | – | – | – | ||
[26] | – | – | – | – | ||
[12] | – | – | ||||
[13] | – | – | – | |||
[36] | – | – | – | – | ||
[37] | – | – | – | |||
[40] | – | – | ||||
Proposed |
Papers marked with a check indicate that they performed a comparison with the respective configuration
SOD based on 4D inputs
Early 4D light field SOD algorithms mainly relied on manually extracted superpixel-level features. Li et al. [31] combined color contrast maps with foreground priors and proposed the first SOD model based on light field information. Subsequently, Li et al. designed a general SOD model that could handle input data of heterogeneous types. They constructed a saliency dictionary based on the specific characteristics of the data, removed outliers from the dictionary, and obtained the final saliency map through iterative optimization of the dictionary [32]. Zhang et al. [33] used depth contrast to complement color features and improved the model’s performance by using background priors. In addition, Zhang et al. proposed combining multiple light field visual cues, including color, depth, focus points, and angles, to detect salient regions and introduced location priors to enhance the saliency map [34]. However, methods that rely on manually extracted features have made slow progress in improving the performance of SOD tasks due to the lack of heuristic semantic features. In recent years, an increasing number of deep learning-based light field SOD methods have been proposed, which can be roughly divided into two categories based on the type of the input data: methods based on focal stack images and methods based on multi-view images.
Among methods based on focal stack images, Wang et al. [35] proposed a pyramid attention structure that efficiently captures spatial relationship features between focal stack images by concentrating attention on salient regions of the network while leveraging multi-scale information. To address the computational and storage burden associated with high-dimensional light field data, Piao et al. [36] proposed an asymmetric dual-stream teacher–student network that transfers knowledge learned by a teacher network processing focal stack images to a student network processing RGB images based on knowledge distillation. Zhang et al. [22] proposed a novel memory decoder for light field SOD that explores the relationships between different focal stack images and designed a unique fusion mechanism (MO-SFM) to differentiate the influence of different focal stack images on the saliency map. To integrate focal stack and all-in-focus images, Wang et al. [37] proposed a dual-stream fusion framework where the focal stack stream adaptively learns image features and weighting factors using a recurrent attention mechanism and combines them with the output maps generated by the all-in-focus stream to predict salient regions.
In contrast to methods based on focal stack images that primarily focus on the focused regions, methods based on multi-view images are more concerned with the correlation between space and angle. Piao et al. [38] proposed a single-view-based light field SOD algorithm that first calculates the depth information from the central view and then restores the multilayer horizontal and vertical view images, and predicts the saliency map from the multi-view saliency mapping. Zhang et al. [39] introduced a model angle change module to capture the differences in angle information from the micro-lens image array, and the obtained features were input into an improved Deeplab-v2 network to predict salient objects. Jing et al. [40] used epipolar plane images (EPIs) transformed from multi-view images as partial input to extract the depth and occlusion information of the scene and achieved more accurate SOD by combining these features with the spatial features from the central view. Zhang et al. [41] designed a multi-task learning network for light field SOD using spatial, edge, and depth clues, and modeled the disparity correlation between multi-view images using 3D convolution. However, using all 9×9 view images as input without filtering can result in a large amount of redundant data and computational overhead due to the narrow baseline between adjacent viewpoints. On the other hand, selecting only certain continuous horizontal or vertical view images as input can be difficult to fully utilize the disparate disparity information [41]. Therefore, selecting appropriate multi-view images as input for achieving better SOD remains a crucial issue worth studying.
Based on the analysis above, it is evident that providing more scene information in the model input, such as depth and angle information, can significantly enhance detection performance [12, 34]. Additionally, exploring boundary information of pixel abrupt regions in the scene is a highly effective means [12, 25]. Moreover, multi-feature fusion also enables the model to obtain comprehensive information [13, 22]. Finally, reducing data redundancy can alleviate hardware burdens such as memory and computational speed. In this context, we explore recent works investigating the optimal combination of SOD configurations. These papers compare at least one parameter: RGB images, depth, angle, boundary, feature fusion, and redundancy reduction. Table 1 presents the papers and configurations investigated in their study.
Methods
Viewpoint selection schemes
As shown in Fig. 2, light field data possesses extra angular resolution in both horizontal and vertical directions in contrast to RGB images, consequently leading to a larger volume of data. Although using all 9×9 viewpoints of light field data as input can provide more accurate spatial information, it leads to an increase in computational complexity. In light field depth estimation tasks, to address the issue of data redundancy and high computational cost, only a set of continuous viewpoints in certain angular directions as shown in Fig. 3a–c are usually selected [42]. This continuous viewpoint selection scheme is beneficial for extracting more precise slope information when using EPIs images for depth estimation. However, due to the small disparity between adjacent viewpoints, it can lead to adverse effects in SOD tasks. To address these issues, this study adopts a discrete viewpoint selection scheme, as shown in Fig. 3d, which can balance the total amount and distribution bias of disparity while using fewer viewpoints.
Fig. 3 [Images not available. See PDF.]
Illustration of different viewpoint selection schemes
The light field can be represented as L(x, y, u, v), where (x, y) is the spatial resolution and (u, v) is the angular resolution. The relationship between the center view image and other viewpoints can be expressed as:
1
where d(x, y) denotes the disparity between pixel (x, y) in the center view and the corresponding pixel in the adjacent viewpoint. As , for the angular direction , the relationship can be reformulated as:2
In general, the viewpoint index is an integer, and when is not an integer, there is no corresponding viewpoint available. Therefore, four angular directions of 0°, 45°, 90°, and 135° are selected for disparity calculations. Assuming that the center view is at the origin, the disparity of the corresponding pixel in other views can be calculated using Eq. (3):3
Table 2. Parallax values of different viewpoint selection schemes
Scheme | Number of viewpoints | Total disparity | Disparity per view |
|---|---|---|---|
a(one stream) | 9 | 20 | 2.22 |
b(two streams) | 17 | 40 | 2.35 |
c(four streams) | 33 | 107 | 3.24 |
d(discrete) | 13 | 50 | 3.85 |
Fig. 4 [Images not available. See PDF.]
The architecture overview of the proposed network. It is worth noting that the RGB image shown in the figure corresponds to the central view image in the MultiView images. The DEM module is enclosed by a dashed box, indicating that this module is exclusively used for the MultiView input stream
To simplify the calculations, we assume that the disparity between adjacent viewpoints (horizontal or vertical) is one unit. Then, we use the center viewpoint coordinate as the origin for calculating the disparity. Table 2 presents the total disparity and the average disparity for each of the four viewpoint selection schemes illustrated in Fig. 3.
Fig. 5 [Images not available. See PDF.]
Detailed structure of the DEM
Due to the narrow baseline between multi-view images, the absolute value of the total disparity has limited significance for extracting spatial features in SOD. On the other hand, over-reliance on the average disparity of a single view can lead to angular bias. Therefore, based on the statistical results in Table 2, the viewpoint selection scheme proposed in this study can achieve a balanced distribution of angles and amount of disparity with a relatively small number of viewpoints.
Feature extraction
According to the analysis in references [43, 44], we use ResNet-50 [45] as the backbone network in this study, and only the convolutional layers are kept while the fully connected layers are removed to handle input images with different resolutions. When the input image size is , the model produces five different scales of features , where the spatial resolution of each scale decreases by half compared to the previous layer. Therefore, the size of the i-th feature map can be represented as , where denotes the feature of the i-th channel. Since the performance gain of the network is limited by increasing the computational cost of the lower-level features [46], only features from are used in our model, as shown in Fig. 4. The model takes three types of image data as input, including multi-view images, RGB image, and depth image, which are used to extract spatial features, edge features, and depth features, respectively, resulting in three sets of features represented as , , and , respectively. The dimensions of the three sets of features are all compressed to 64 channels and sent to the decoder network for saliency mapping. Specifically, to properly integrate the disparity information of the multi-view images, a disparity extraction module (DEM) is added between the encoding and decoding networks at each stage for the multi-view image data.
Disparity extraction module (DEM)
The angular dimension information of light field data is derived from the disparity between any two viewpoints. For the image array illustrated in Fig. 3d, there exists only horizontal disparity between all viewpoints in the same row and vertical disparity between all viewpoints in the same column. According to the analysis in [42], extracting the disparity in only one direction is more conducive to uncovering the relationship between viewpoints. Therefore, in order to extract disparity more effectively, we select 13 viewpoints from four angular directions (0°, 45°, 90°, and 135°) as input to achieve SOD. As discussed in [47], when there are multiple forms of features for the same image in a network, the common part of the features has a greater probability of being the saliency region. To this end, we design a disparity extraction module (DEM), consisting of three Sub-DEM, with its specific structure shown in Fig 5. The overlapping parts of feature maps from different angular directions are used for preliminary localization of salient objects, while the supplementary feature information from the difference parts is used to improve the accuracy of SOD in complex scenes such as occlusion, high and low illumination, and small targets.
As shown in Fig. 3, the spatial feature differences extracted from different viewpoints are usually small and contain redundant information. In order to extract the differences between these features, we add the DEM at all four stages of the encoding, and each DEM takes the features extracted from images in four angular directions (0°, 45°, 90°, and 135°) as input. Subsequently, the feature streams from four angular directions are paired in a pairwise manner according to their orthogonal relationship for further processing. Taking the orthogonal 0° and 90° feature streams as an example (as shown in Fig. 5), where the 0° feature stream is denoted as a and the 90° feature stream is denoted as b. The common feature between them is obtained using element-wise multiplication as an approximate logic AND calculation, which is computed as follows:
4
Then, unique features of a and b are obtained with the following formulas:5
6
Finally, the common and unique features of the two feature streams are merged to obtain the fusion feature:7
Due to the increase in feature values caused by element-wise multiplication, a decay coefficient =0.4 is introduced. Similarly, orthogonal fusion feature at 45° and 135° can be obtained through another Sub-DEM. In addition, the two sets of fusion features are used as inputs to feed into the third Sub-DEM to obtain the final feature. This process locates salient objects in the image by computing the common features of the multi-view images, and efficiently aggregates different viewpoints’ differential information, which is equivalent to small-scale dynamic observation of the objects in the scene, thus increasing the dimensionality of information acquisition.Fusion-feedback module (FFM)
The input of the proposed network consists of three types of data, where the RGB stream learns the edge features of salient targets and the depth stream mainly mines the depth features of different objects in the scene. The two streams complement and overlap with the final saliency map, and thus the FFM is designed to achieve feature interaction among branches. As shown in Fig. 4, the FFM is a 3-branch structure, where the multi-view and depth streams are supervised by ground truth (GT), while the edge part of GT supervises the RGB stream.
The overall structure of the network consists of three parts: the main encoding module (encoder), decoding module (decoder), and feedback encoding module (recoder). The encoder, implemented by ResNet-50, is used to extract multimodal features ,, and , and the three features obtained are fed into the decoder to acquire rough saliency maps, edge maps and depth maps, respectively. The three features obtained by the decoder can be directly stacked and fused in the channel dimension to calculate final saliency map through a convolutional layer. However, direct fusion will lead to three problems: (1) the depth feature contains abundant depth information, including the depth information of the background and non-salient objects, which would cause interference. (2) The saliency map obtained by direct fusion usually has low boundary quality and discontinuous predicted region probabilities [47]. (3) Direct fusion cannot fully explore the relationship between the features extracted by each branch.
Therefore, the FFM is designed to address these issues. First, by taking the logical AND operation between the depth features and the edge features and spatial features , respectively, interference caused by useless depth information is excluded. The obtained intersected features are then connected with the edge features and spatial features for final saliency prediction as:
8
where denotes the fusion feature, represents element-wise multiplication, and cat represents channel-wise concatenation. In addition, inspired by BASNet that uses a prediction–refinement structure for further refinement of saliency prediction, the proposed FFM takes the above fusion feature as input, generates features of different scales through a simple feedback encoding network, and then inputs the features to two convolutional blocks to reduce the channel dimension to 64, which, respectively, feedback to the previous spatial decoder and edge decoder. The decoder connects the fusion feature from the feedback encoder and the feature from the main encoder in each scale feature layer to realize the interaction of different modal information, thereby generating more accurate saliency maps.Loss function
As illustrated in Fig. 4, the proposed network is trained end-to-end with different modalities of saliency maps. Therefore, the loss function is defined as a combination of four losses:
9
where , and denote the loss functions for spatial, edge, depth, and fusion features, respectively. For the spatial, edge, and depth branches, the binary cross-entropy (BCE) loss, which is the most widely used in binary classification, is employed as shown in Eq. 10:10
where g(x, y) represents the ground-truth label of the pixel (x, y), with a value range of [0,1], and p(x, y) denotes the predicted probability, also with a value range of [0,1]. However, since the foreground and background have equal weight in the BCE loss, this loss function mainly focuses on the pixel-level differences in the saliency map, without considering the influence of adjacent pixel ground-truth values. Therefore, the IoU [48] loss is introduced in Eq. 11 to focus on the overall structural information:11
where is defined as:12
Experiments
Datasets
There are currently six open-source light field datasets available: LFSD [31], HFUT-Lytro [34], Lytro-Illum [39], DUTLF-FS [22], DUTLF-MV [38], and DUTLF-V2 [49]. LFSD and HFUT-Lytro contain only 100 and 255 scenes, respectively, with small scales, low resolutions, limited scene categories, and no division into training and testing sets. DUTLF-FS is primarily used for methods based on light field refocusing data. Lytro-Illum comprises 640 scenes with a resolution of , while DUTLF-MV contains 1580 scenes with a resolution of , and DUTLF-V2 includes 4204 scenes with a resolution of . These three datasets are relatively large in scale, high in resolution, and comprehensive in scene variety, making them widely adopted by researchers [39, 40–41, 50, 51].
Evaluation metrics
We conduct quantitative analysis on the proposed model using four commonly used metrics for SOD.
(1) F-measure calculates the weighted average of Precision and Recall using a nonnegative weight, which is defined as:
13
where represents the weight between Precision and Recall. According to [52], is commonly set to 0.3 to achieve optimal Precision. Different forms of can be obtained by varying Precision and Recall parameters. In this study, we use .(2) typically used to measure the spatial structure similarity between predicted saliency maps and labels [53], the formula is:
14
where and represent the structure similarity of object perception and region perception, respectively. is the balance coefficient between and , which is set to 0.5 based on [53].(3) is a metric that considers both the local pixel similarity between predicted saliency maps and labels and global pixel statistical data [54], the formula is:
15
where is the consistency gain matrix, W and H represent the width and height of the label values, i and j represent pixel indices. Similar to , we also use in this study.(4) MAE is an overlapping evaluation metric that describes the probability of correctly salient pixels being assigned as non-salient. The formula is:
16
where W andH represent the width and height of the label values; i and j represent pixel indices, P andG represent the predicted saliency map and label, respectively.Table 3. Quantitative comparisons between other methods. ’Ours’ and ’Ours*,’ respectively, denote the models proposed in this paper using ResNet-50 and VGG-16 as backbone networks
Models | DUTLF-V2 | DUTLF-MV | Lytro-Illum | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
↑ | ↑ | ↑ | MAE | ↓ ↑ | ↑ | ↑ | MAE ↓ | ↑ | ↑ | ↑ | MAE ↓ | ||
Light Field | Ours | 0.9188 | 0.9037 | 0.9503 | 0.0275 | 0.9069 | 0.9159 | 0.9465 | 0.0362 | 0.9082 | 0.8851 | 0.9351 | 0.0298 |
Ours* | 0.8900 | 0.8912 | 0.9309 | 0.0366 | 0.8819 | 0.8892 | 0.9286 | 0.0459 | 0.8857 | 0.8573 | 0.9127 | 0.0402 | |
OBGNet | 0.8961 | 0.8719 | 0.9287 | 0.0368 | 0.9440 | 0.9401 | 0.9611 | 0.0223 | 0.8936 | 0.8700 | 0.9236 | 0.0405 | |
ERNet | 0.8296 | 0.8069 | 0.8949 | 0.0582 | 0.8638 | 0.8810 | 0.9111 | 0.0534 | 0.8168 | 0.8070 | 0.8818 | 0.0649 | |
MoLF | 0.8141 | 0.7548 | 0.8710 | 0.0732 | 0.8633 | 0.8468 | 0.9043 | 0.0630 | 0.8068 | 0.7561 | 0.8629 | 0.0783 | |
RGB-D | SSF | 0.8282 | 0.7826 | 0.8667 | 0.0621 | 0.8726 | 0.8680 | 0.9070 | 0.0524 | 0.8287 | 0.7734 | 0.8651 | 0.0605 |
BBS | 0.9092 | 0.8952 | 0.9440 | 0.0290 | 0.8970 | 0.9025 | 0.9352 | 0.0404 | 0.9001 | 0.8649 | 0.9194 | 0.0372 | |
JLDCF | 0.9124 | 0.8865 | 0.9379 | 0.0307 | 0.8875 | 0.9033 | 0.9406 | 0.0418 | 0.9068 | 0.8791 | 0.9318 | 0.0312 | |
UCNet | 0.9007 | 0.8841 | 0.9400 | 0.0325 | 0.8822 | 0.8959 | 0.9318 | 0.0445 | 0.7980 | 0.7368 | 0.8477 | 0.0690 | |
90RGB | BASNet | 0.7800 | 0.7235 | 0.8495 | 0.0887 | 0.8938 | 0.8742 | 0.9227 | 0.0507 | 0.8336 | 0.8025 | 0.8847 | 0.0647 |
U2Net | 0.8810 | 0.8594 | 0.9168 | 0.0408 | 0.8890 | 0.8980 | 0.9362 | 0.0420 | 0.8848 | 0.8572 | 0.9137 | 0.0396 | |
EGNet | 0.9076 | 0.8789 | 0.9278 | 0.0343 | 0.8932 | 0.8938 | 0.9362 | 0.0419 | 0.8902 | 0.8478 | 0.9111 | 0.0395 | |
CPD | 0.8874 | 0.8541 | 0.9198 | 0.0409 | 0.8920 | 0.8963 | 0.9346 | 0.0429 | 0.8857 | 0.8452 | 0.9110 | 0.0410 | |
F3Net | 0.8889 | 0.8574 | 0.9231 | 0.0409 | 0.8826 | 0.8897 | 0.9337 | 0.0405 | 0.8894 | 0.8520 | 0.9190 | 0.0413 | |
The bold, italic, and bolditalic represent the best, second-best, and third-ranked results, respectively
Fig. 6 [Images not available. See PDF.]
PR curves on three datasets
Implementation details
The proposed model is trained on the training set of DUTLF-V2 and evaluated on its testing set, DUTLF-MV, and Lytro-Illum datasets. To prevent overfitting, we augment the input images using horizontal flipping, random cropping, and random multi-scale operations. Furthermore, we initialize the parameters of the backbone network with pre-trained ResNet-50 (referred to as ’Ours’ in Table 3) and VGG-16 (referred to as ’Ours*’ in Table 3), respectively, while randomly initializing all other parameters. The experiments are conducted using PyTorch 1.7.1 on a GeForce GTX 3090Ti GPU. We adopt stochastic gradient descent (SGD) as our optimization algorithm, with a maximum learning rate of 0.002 for the ResNet-50 backbone and 0.02 for all other parts. The momentum and weight decay coefficients are set to 0.9 and 0.0005, respectively, and the batch size is set to 2. The model converges after 60 iterations.
Comparison with other methods
In this section, we compare our model with 12 learning-based SOD models, including three light field SOD models (OBGNet [40], ERNet [36], and MOLF [22]), four RGB-D SOD models (SSF [12], BBS [13], JLDCF [15], and UCNet [14]), and five RGB SOD models (BASNet [25], U2Net [14], EGNet [55], CPD [56], and F3Net [26]). To ensure fairness, all models mentioned above are retrained on the training set of the DUTLF-V2 dataset.
Fig. 7 [Images not available. See PDF.]
Qualitative comparisons between other methods
Quantitative evaluation
We use four evaluation metrics for quantitative analysis, with MAE and F-measure to reflect accuracy of results, while E-measure and S-measure to reflect completeness of salient structures. As shown in Table 3 (with the top three performers highlighted in bold, italic, and bolditalic), the proposed model achieves the best results on all evaluation metrics in the largest light field dataset, DUTLF-V2. Compared to the second-best performing method, OBGNet, the proposed model improves F-measure accuracy by , reduces MAE by , and increases E-measure and S-measure by . In the DUTLF-MV dataset, the proposed model obtains the second-best results, only slightly lower than OBGNet, while using only 1/6 of OBGNet’s data. In the Lytro-Illum dataset, the proposed model also achieves the best results on all evaluation metrics. Additionally, it can be observed that the use of ResNet-50 yields better results compared to VGG-16. This is attributed to the deeper architecture of the former and its effective mitigation of the degradation problem in deep neural networks through skip connections. As shown in the PR curves of Fig. 6, the proposed model obtains the best results on both DUTLF-V2 and Lytro-Illum datasets.
Qualitative comparison
The 10 images used for qualitative comparison are selected from different datasets and includes various challenging scenes. As illustrated in Fig. 7, these scenes include multi-object small object (3rd row), high and low lighting conditions (4th and 10th row), salient object in a background-similar scene (6th row), similar object interference in the background (8th row), and cluttered scene with various interferences (1st, 5th, and 9th row). The proposed model exhibits significant advantages in salient object localization and segmentation, as well as edge details for all these challenging scenes.
Ablation study
To validate the effectiveness of the modules and data input schemes in the proposed model, ablation experiments are carried out on the DUTLF-V2 dataset.
The effectiveness of the viewpoint selection scheme
Previous studies [38, 39, 40–41] have shown that the angular information provided by multi-view images can provide more accurate spatial localization capability in SOD tasks, thereby improving the detection accuracy in some challenging scenes.
However, it is important to find a balance between image data redundancy and model performance. In this study, we compare four different viewpoint selection schemes described in Section 3.5. It is worth noting that scheme (a) only has one angular direction and cannot use the proposed DEM. Therefore, it is processed in the same way as RGB images and directly connected to the corresponding decoder after encoding. Scheme (b) only has horizontal and vertical angular directions and requires only one proposed Sub-DEM. Furthermore, in addition to the input schemes listed in Section 3.1, we have also tested a concentric-circle input scheme, whose viewpoint selection method is illustrated in Fig. 8 (this scheme also incorporates the DEM module). As shown in Table 4, compared with schemes (a) and (b), our model has a significant advantage. Although scheme (c) has more viewpoints, according to the analysis by reference [50], most of the selected viewpoints are close to the central viewpoint, resulting in data redundancy, angle bias issues, and worse performance. In contrast to the proposed input scheme, the concentric-circle input scheme places more emphasis on horizontal and vertical views while downplaying diagonal views. According to the analysis in reference [50], corner views exhibit more distinct differences compared to central views, and the information they provide is crucial for the saliency detection task.
The effectiveness of the multi-stream structure
To validate the effectiveness of the proposed multi-stream structure in this study, we conduct ablation experiments on different network streams, and the comparative results are presented in Table 5. In this table, RGB represents a single-stream network when only the central viewpoint is used as input. In this case, the network structure consists of a simple encoding–decoding structure. RGB+MultiView indicates inputting discrete multi-view images of the light field, incorporating the DEM module and fusion-feedback module designed for light field structures, resulting in a significant improvement in performance. RGB+MultiView+Depth adds the depth stream, representing the complete network structure employed in this study.
Fig. 8 [Images not available. See PDF.]
Viewpoint selection method for concentric-circle input scheme
The effectiveness of the DEM
The baseline network of this study has a similar encoding–decoding structure to U-Net [57]. Without the DEM, features of multiple viewpoints from the same angular direction are directly stacked and fused in the channel dimension to calculate final saliency map. As shown in Table 6, the DEM significantly improves the accuracy of saliency prediction results.
Table 4. Comparisons between different viewpoint selection schemes
Methods | DUTLF-V2 | |||
|---|---|---|---|---|
↑ | ↑ | ↑ | MAE ↓ | |
a | 0.8719 | 0.8563 | 0.9169 | 0.0445 |
b | 0.8915 | 0.8713 | 0.9280 | 0.0392 |
c | 0.9062 | 0.8853 | 0.9396 | 0.0335 |
concentric-circle | 0.9079 | 0.8857 | 0.9388 | 0.0330 |
Ours | 0.9188 | 0.9037 | 0.9503 | 0.0275 |
Bold values indicate the best result
Table 5. Experimental results on the effectiveness of the multi-stream structure
Methods | DUTLF-V2 | |||
|---|---|---|---|---|
↑ | ↑ | ↑ | MAE ↓ | |
RGB | 0.8553 | 0.8243 | 0.8759 | 0.0625 |
RGB+MultiView | 0.9074 | 0.8837 | 0.9378 | 0.0319 |
RGB+MultiView+Depth | 0.9188 | 0.9037 | 0.9503 | 0.0275 |
Bold values indicate the best result
Table 6. Experimental results on the effectiveness of the DEM
Methods | DUTLF-V2 | |||
|---|---|---|---|---|
↑ | ↑ | ↑ | MAE ↓ | |
Baseline | 0.8794 | 0.8432 | 0.9195 | 0.0437 |
Baseline+FFM | 0.9050 | 0.8819 | 0.9393 | 0.0331 |
Baseline+FFM+DEM | 0.9188 | 0.9037 | 0.9503 | 0.0275 |
Bold values indicate the best result
Table 7. Experimental results on the effectiveness of the FFM
Methods | DUTLF-V2 | |||
|---|---|---|---|---|
↑ | ↑ | ↑ | MAE ↓ | |
Baseline | 0.8794 | 0.8432 | 0.9195 | 0.0437 |
Baseline+DEM | 0.8920 | 0.8581 | 0.9299 | 0.0381 |
Baseline+DEM+Refine | 0.9034 | 0.8757 | 0.9394 | 0.0336 |
Baseline+FFM+DEM | 0.9188 | 0.9037 | 0.9503 | 0.0275 |
Bold values indicate the best result
Table 8. Experimental results on different combinations of geometric transformation methods
Scheme | DUTLF-V2 | |||
|---|---|---|---|---|
↑ | ↑ | ↑ | MAE ↓ | |
horizontal flipping+random cropping+random multi-scal | 0.9188 | 0.9037 | 0.9503 | 0.0275 |
horizontal flipping+Translation+Rotation | 0.9159 | 0.8930 | 0.9416 | 0.0290 |
Translation+random cropping+random multi-scal | 0.9126 | 0.8926 | 0.9329 | 0.0305 |
Translation+random cropping+Rotation | 0.9091 | 0.8798 | 0.9227 | 0.0347 |
no data augmentation | 0.8937 | 0.8559 | 0.9231 | 0.0410 |
Bold values indicate the best result
The effectiveness of the FFM
We compare four different network architectures as shown in Table 7, which includes the baseline network structure, identical to the one used in Table 6; Baseline+DEM, a structure that directly fuses the spatial, edge, and depth features after the baseline network and removes the FFM; Baseline+DEM+Refine, a structure that inputs the spatial, edge, and depth features into a Refine module [25] after fusion; and Baseline+DEM+FF, the final network model structure proposed in this study. Experimental results indicate that the performance of the Baseline+DEM+FF structure is superior to the others. Although the Baseline+DEM structure successfully fuses all the salient features, they are relatively coarse and lack further refinement. On the other hand, although Baseline+DEM+Refine refines the fusion features through the encoding–decoding process with the Refine module, it lacks information interaction between different features and further deepens the network hierarchy, thus making it difficult to optimize.
Data augmentation methods
Common geometric transformation methods include flipping, cropping, rotation, translation, and random multi-scale operations, as shown in Table 8. We have selected three of these for data augmentation in our study, as outlined in the table below, and tested them on the DUTLF-V2 dataset. Experimental results demonstrate that employing a combination of horizontal flipping, random cropping, and random multi-scale operations yields the best performance for our proposed method.
Conclusions
This paper presents a novel approach for light field SOD based on discrete viewpoint selection and multi-feature fusion. First, we design a novel scheme for selecting viewpoints based on the geometric features of multi-view images, which enables us to obtain a large disparity range with less light field data, while ensuring unbiased distribution of viewpoints. Second, we propose a disparity estimation module (DEM) to reduce data redundancy and obtain effective disparity information, which accurately locates salient objects by extracting common and distinctive features from different viewpoints. Finally, we design a fusion-feedback module (FFM) to fuse spatial, edge, and depth features, which further refines the saliency prediction results through complementary interactions between different features. Our model achieves the best results for salient object detection on both the DUTLF-V2 and Lytro-Illum datasets compared with other methods, and the second-best results on the Lytro-Illum dataset, demonstrating the effectiveness of our approach in challenging scenarios with multiple objects and complex backgrounds. Furthermore, there is still room for improvement in our method. In future work, we plan to consider the spatial relationships of multi-view images and reduce light field data redundancy by constructing the 3D structure of the scene using grayscale images to obtain more meaningful features.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (62171178, 61801161, and 61971177), the Natural Science Foundation of Anhui Province (1908085QF282) and the Fundamental Research Funds for the Central Universities (JZ2020HGTB0048).
Data Availability
The datasets used or analyzed during the current study are available from the corresponding author on reasonable request.
Declarations
Conflict of interest
The authors listed in this article declare that they have no conflict of interest.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Wang, W., Shen, J., Porikli, F.: Saliency-aware geodesic video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3395–3402 (2015)
2. Hong, S., You, T., Kwak, S., Han, B.: Online tracking by learning discriminative saliency map with convolutional neural network. In: International Conference on Machine Learning, pp. 597–606. PMLR (2015)
3. Zhang, Z., Fidler, S., Urtasun, R.: Instance-level segmentation for autonomous driving with deep densely connected mrfs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 669–677 (2016)
4. Liu, G., Fan, D.: A model of visual attention for natural image retrieval. In: 2013 International Conference on Information Science and Cloud Computing Companion, pp. 728–733. IEEE (2013)
5. Wang, Z; Xiaobei, W. Salient object detection using biogeography-based optimization to combine features. Appl. Intell.; 2016; 45, pp. 1-17. [DOI: https://dx.doi.org/10.1007/s10489-015-0739-x]
6. Cheng, Ming-Ming; Mitra, Niloy J; Huang, Xiaolei; Torr, Philip HS; Shi-Min, Hu. Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell.; 2014; 37,
7. Yao, Q., Huchuan, L., Yiqun, X., He, W.: Saliency detection via cellular automata. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 110–119 (2015)
8. Arya, Rinki; Singh, Navjot; Agrawal, RK. A novel combination of second-order statistical features and segmentation using multi-layer superpixels for salient object detection. Appl. Intell.; 2017; 46, pp. 254-271. [DOI: https://dx.doi.org/10.1007/s10489-016-0819-6]
9. Hongshuang Zhang, Yu; Zeng, Huchuan Lu; Zhang, Lihe; Li, Jianhua; Qi, Jinqing. Learning to detect salient object with multi-source weak supervision. IEEE Trans. Pattern Anal. Mach. Intell.; 2021; 44,
10. Cheng, Ming-Ming; Gao, Shang-Hua; Borji, Ali; Tan, Yong-Qiang; Lin, Zheng; Wang, Meng. A highly efficient model to study the semantics of salient object detection. IEEE Trans. Pattern Anal. Mach. Intell.; 2021; 44,
11. Lv, T., Bo, L., Yijie, Z., Shouhong, D., Mofei, S.: Disentangled high quality salient object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3580–3590 (2021)
12. Miao, Z., Weisong, R., Yongri, P., Zhengkun, R., Huchuan, L.: Select, supplement and focus for rgb-d saliency detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3472–3481 (2020)
13. Deng-Ping, F., Yingjie, Z., Ali, B., Jufeng, Y, Ling, S.: Bbs-net: Rgb-d salient object detection with a bifurcated backbone strategy network. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII, pp. 275–292. Springer (2020)
14. Zhang, Jing; Fan, Deng-Ping; Dai, Yuchao; Anwar, Saeed; Saleh, Fatemeh; Aliakbarian, Sadegh; Barnes, Nick. Uncertainty inspired rgb-d saliency detection. IEEE Trans. Pattern Anal. Mach. Intell.; 2021; 44,
15. Fu, K., Fan, D-P., Ji, G-P., Zhao, Q.: Jl-dcf: joint learning and densely-cooperative fusion framework for rgb-d salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3052–3062 (2020)
16. Guanyu, Z., Longsheng, W., Siyuan, G., Yongtao, W.: A cascaded refined rgb-d salient object detection network based on the attention mechanism. Appl. Intell. 1–22 (2022)
17. Keren, Fu; Fan, Deng-Ping; Ji, Ge-Peng; Zhao, Qijun; Shen, Jianbing; Zhu, Ce. Siamese network for rgb-d salient object detection and beyond. IEEE Trans. Pattern Anal. Mach. Intell.; 2021; 44,
18. Fan, Deng-Ping; Lin, Zheng; Zhang, Zhao; Zhu, Menglong; Cheng, Ming-Ming. Rethinking rgb-d salient object detection: models, data sets, and large-scale benchmarks. IEEE Trans. Neural Netw. Learn. Syst.; 2020; 32,
19. Gao, Wei; Liao, Guibiao; Ma, Siwei; Li, Ge; Liang, Yongsheng; Lin, Weisi. Unified information fusion network for multi-modal rgb-d and rgb-t salient object detection. IEEE Trans. Circuits Syst. Video Technol.; 2021; 32,
20. Ng, R., Levoy, M., Brédif, M., Duval, G., Horowitz, M., Hanrahan, P.: Light field photography with a hand-held plenoptic camera. Ph.D. thesis, Stanford University (2005)
21. Marc, L., Pat, H.: Light field rendering. In: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 31–42 (1996)
22. Zhang, M., Li, J., Wei, J., Piao, J., Lu, H.: Memory-oriented decoder for light field salient object detection. Advances in Neural Information Processing Systems, 32, 898–908
23. Zhang, Miao; Ji, Wei; Piao, Yongri; Jingjing Li, Yu; Zhang, Shuang Xu; Huchuan, Lu. Lfnet: light field fusion network for salient object detection. IEEE Trans. Image Process.; 2020; 29, pp. 6276-6287. [DOI: https://dx.doi.org/10.1109/TIP.2020.2990341]
24. Itti, Laurent; Koch, Christof; Niebur, Ernst. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell.; 1998; 20,
25. Xuebin, Q., Zichen, Z., Chenyang, H., Chao, G., Masood, D., Martin, J.: Basnet: boundary-aware salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7479–7489 (2019)
26. Wei, J., Wang, S., Huang, Q.: Fnet: fusion, feedback and focus for salient object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, 34, pp. 12321–12328 (2020)
27. Liu, J-J., Hou, Q., Cheng, M-M., Feng, J., Jiang, J.: A simple pooling-based design for real-time salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3917–3926 (2019)
28. Qin, Xuebin; Zhang, Zichen; Huang, Chenyang; Dehghan, Masood; Zaiane, Osmar R; Jagersand, Martin. U2-net: going deeper with nested u-structure for salient object detection. Pattern Recognit.; 2020; 106, [DOI: https://dx.doi.org/10.1016/j.patcog.2020.107404] 107404.
29. Wu, Z., Su, L., Huang, Q.: Stacked cross refinement network for edge-aware salient object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7264–7273 (2019)
30. Li, G., Xie, Y., Wei, T., Wang, K., Lin, L.: Flow guided recurrent neural encoder for video salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3243–3252 (2018)
31. Li, N., Ye, J., Ji, Y., Ling, H., Yu, J.: Saliency detection on light field. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2806–2813 (2014)
32. Li, N., Sun, B., Yu, J.: A weighted sparse coding framework for saliency detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5216–5223 (2015)
33. Zhang, J., Wang, M., Gao, J., Wang, Y., Zhang, X., Wu, X.: Saliency detection with a deeper investigation of light field. In: IJCAI, pp. 2212–2218 (2015)
34. Zhang, J; Wang, M; Lin, L; Yang, X; Gao, J; Rui, Y. Saliency detection on light field: a multi-cue approach. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM); 2017; 13,
35. Wang, W., Zhao, S., Shen, J., CH Hoi, S., Borji, A.: Salient object detection with pyramid attention and salient edges. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1448–1457 (2019)
36. Piao, Y., Rong, Z., Zhang, M., Lu, H.: Exploit and replace: an asymmetrical two-stream architecture for versatile light field saliency detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, 34, pp. 11865–11873 (2020)
37. Wang, T., Piao, Y., Li, X., Zhang, L., Lu, H.: Deep learning for light field saliency detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8838–8848 (2019)
38. Piao, Y., Rong, Z., Zhang, M., Li, X., Lu, H.: Deep light-field-driven saliency detection from a single view. In: IJCAI, pp. 904–911 (2019)
39. Zhang, Jun; Liu, Yamei; Zhang, Shengping; Poppe, Ronald; Wang, Meng. Light field saliency detection with deep convolutional networks. IEEE Trans. Image Process.; 2020; 29, pp. 4421-4434. [DOI: https://dx.doi.org/10.1109/TIP.2020.2970529]
40. Jing, D., Zhang, S., Cong, R., Lin, Y.: Occlusion-aware bi-directional guided network for light field salient object detection. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1692–1701 (2021)
41. Zhang, Qiudan; Shiqi Wang, Xu; Wang, Zhenhao Sun; Kwong, Sam; Jiang, Jianmin. A multi-task collaborative network for light field salient object detection. IEEE Trans. Circuits Syst. Video Technol.; 2020; 31,
42. Shin, C., Jeon, H-G., Yoon, Y., So Kweon, T., Joo Kim, S.: Epinet: a fully-convolutional neural network using epipolar geometry for depth from light field images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4748–4757 (2018)
43. Wang, T., Zhang, L., Wang, S., Lu, H., Yang, G., Ruan, X., Borji, A.: Detect globally, refine locally: a novel approach to saliency detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3127–3135 (2018)
44. Wang, T., Borji, A., Zhang, L., Zhang, P., Lu, H.: A stagewise refinement model for detecting salient objects in images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4019–4028 (2017)
45. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
46. Wu, R., Feng, M., Guan, W., Wang, D., Lu, H., Ding, E.: A mutual learning method for salient object detection with intertwined multi-supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8150–8159 (2019)
47. Jiang, Yao; Zhang, Wenbo; Keren, Fu; Zhao, Qijun. Meanet: multi-modal edge-aware network for light field salient object detection. Neurocomputing; 2022; 491, pp. 78-90. [DOI: https://dx.doi.org/10.1016/j.neucom.2022.03.056]
48. Máttyus, G., Luo, W., Urtasun, R.: Deeproadmapper: extracting road topology from aerial images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3438–3446 (2017)
49. Piao, Y., Rong, Z., Xu, S., Zhang, M., Lu, H.: Dut-lfsaliency: versatile dataset and light field-to-rgb saliency detection. arXiv preprint arXiv:2012.15124 (2020)
50. Zhang, Qiudan; Shiqi Wang, Xu; Wang, Zhenhao Sun; Kwong, Sam; Jiang, Jianmin. Geometry auxiliary salient object detection for light fields via graph neural networks. IEEE Trans. Image Process; 2021; 30, pp. 7578-7592.
51. Gao, W., Fan, S., Li, G., Lin, W.: A thorough benchmark and a new model for light field saliency detection. IEEE Trans. Pattern Anal. Mach. Intell. (2023)
52. Borji, Ali; Cheng, Ming-Ming; Jiang, Huaizu; Li, Jia. Salient object detection: a benchmark. IEEE Trans. Image Process.; 2015; 24,
53. Fan, D-P., Cheng, M-M., Liu, Y., Li, T., Borji, A.: Structure-measure: a new way to evaluate foreground maps. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4548–4557 (2017)
54. Fan, D.-P., Gong, C., Cao, Y., Ren, B., Cheng, M-M., Borji, A.: Enhanced-alignment measure for binary foreground map evaluation. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp. 698–704. International Joint Conferences on Artificial Intelligence Organization (2018)
55. Zhao, J.-X., Liu, J.J., Fan, D.-P., Cao, Y., Yang, J., Cheng, M.-M.: Egnet: edge guidance network for salient object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8779–8788 (2019)
56. Wu, Z., Su, L., Huang, Q.: Cascaded partial decoder for fast and accurate salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3907–3916 (2019)
57. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18, pp. 234–241. Springer (2015)
Copyright Springer Nature B.V. Jan 2025