Content area
The image rectangling task aims to solve the problem of boundary irregularity in the stitched image without reducing the wide-field-of-view content information of the stitched image. Existing image rectangling methods are either limited by application scenarios, or have a little incomplete phenomenon in the rectangular boundary. To this end, we propose a stepwise deep rectangling model based on the idea of stepwise regression for the general image rectangling task. Considering the influence of factors such as luminosity differences in various regions of the image, we introduce a shallow feature encoder to eliminate the influence of such factors on mesh prediction. At the same time, we embed the mask information into the encoded image to constrain the network to learn a rectangular image with complete boundary. Subsequently, we perform mesh cumulative regression prediction based on the multi-level features extracted by the feature extractor. Experimental results show that the proposed method performs well in a variety of stitched image rectangling scenarios, exhibiting state-of-the-art performance in both qualitative and quantitative comparisons.
Introduction
Image stitching is an important branch of computer vision that aims to stitch together a set of images from different moments, different perspectives or different sensors with certain overlapping areas into a seamless, wide-field-of-view image [42]. It has important applications in areas such as motion detection and tracking, remote sensing image processing, surveillance video [14, 30] and geoscience [10]. With the development of deep learning in image stitching technology in recent years, the creation of panoramas and ultra-wide-view images has gained increasing interest [59, 61]. However, the stitched images often have unsatisfactory irregular boundaries, which limits the application of stitching technology in our real life. More than 99% of the images in the “Panorama” tab of Flicker (flicker.com) have rectangular boundaries [40], suggesting that most people prefer to accept stitched images with regular boundaries. The purpose of image rectangling is to warp a stitched image containing irregular boundaries into a rectangular image with regular boundaries without reducing the information in the wide-field-of-view content of the stitched image, thus making it better for applications in photographic images, surveillance video, virtual reality and other fields.
Fig. 1 [Images not available. See PDF.]
SDR workflow. The input image with irregular boundaries is passed through feature encoder, feature extractor and three-stage rectangle warping (rectangle warping network) to get the output image with regular boundaries
The simplest solution to the problem of processing images with irregular boundaries is to obtain a rectangular image by direct cropping, but this comes at the expense of some of the image content. Another solution is the image completion technology [3, 9, 11, 24, 29, 51, 53, 54], which is based on filling in the missing areas with image content near the irregular boundary. This method is suitable for filling images with similar textures or simple structures, and is not suitable for filling missing areas in images with complex structures. In addition, such image completion methods produce information that is not otherwise present in the image, which is extremely unreliable in application scenarios that require content fidelity such as surveillance video.
Different from image completion technology, image rectangling is to rectangularize the boundary of images with irregular boundaries without adding additional image information. Existing research on image rectangling has gradually shifted from application-specific image rectangling to image rectangling for general application scenarios, but there are still some problems that need to be solved. He et al. [20] constructed a content-aware mesh warping algorithm using seam carving techniques [16] to achieve rectangularization of images in multi-application scenes, but the method relies on rigid structures in the image and performs poorly in scenes containing nonlinear structures such as portraits and landscapes. Nei et al. [40] first introduced deep learning to the field of image rectangling, achieving the ability to preserve nonlinear structures with a wider range of application scenarios, but there is still a little content loss and incompleteness in the boundaries of their rectangular images.
In order to obtain rectangular images with content fidelity and more complete boundaries, we propose a stepwise deep rectangling model (SDR) based on the idea of stepwise regression for a general image rectangling task. Figure 1 shows the SDR workflow, which consists of a feature encoder, a feature extractor and a rectangle warping network. Specifically, SDR mainly includes four steps: multi-level feature extraction, primary rectangle warping, middle rectangle warping and final rectangle warping.
In the first step, the multi-level feature map is obtained by using the feature encoder and extractor. Feature map1 and Feature map2 represent the 8th level feature map and 6th level feature map extracted from the feature extractor, respectively. In the second step, the feature map1 is passed into the primary rectangle warping network, and the predicted primary warped mesh is used to warp the feature map1. In the third step, middle rectangle warping network further approximates the exact mesh motion by predicting the mesh momentum of the warped feature map1 at the same level and summing it with the primary warped mesh. Shallow feature maps contain more fine-grained information. In order to make the network to capture more detail, the middle warped mesh is first used to warp shallow feature map2, and then we choose to perform the final mesh momentum prediction on the warped shallow feature map2 in the fourth step, finally add it with the middle warped mesh to obtain the final warped mesh, so as to obtain a rectangular image with regular boundaries. A cascaded rectangular warping network is designed based on the idea of stepwise regression, which enables the model to learn the mesh motion that fits the input image as closely as possible through the mesh cumulative regression prediction. The input image is inversely transformed from the final warped mesh to the standard rectangular mesh, and the warped mask is multiplied to obtain the rectangular image.
The experimental results show that the proposed method exhibits optimal rectangularization performance in both qualitative and quantitative comparisons. The specific contributions of this paper are as follows:
A stepwise deep rectangling model is proposed based on the idea of stepwise regression, and the resulting rectangular images have higher content fidelity and boundary integrity.
Feature encoder is introduced to improve the applicability of the model in multiple scenes, while embedding mask information into the encoded images enhances the global attention ability of the model to boundary information.
Related work
In this section, we review the work involved in image stitching, image retargeting, image completion and image rectangling.
Image stitching
In recent years, image stitching technology has developed more rapidly and can be broadly classified into traditional methods and deep learning-based methods. Traditional methods are mainly feature-based [1, 22] stitching methods, some of which replace global homography [15, 37] by computing multiple content-aware local homography to achieve better local alignment [7, 8, 32], while some methods are dominated by stitching lines and do not need to strictly align the entire overlapping region, but only align the region near the stitching lines to align the image in order to reduce artifacts generated by image fusion [35, 57, 61]. However, such methods strictly rely on the detection quality of feature points, and the stitching performance degrades in low-texture scenes. Deep learning-based methods [10, 42, 43] eliminate the dependence on feature points and show high robustness in low-texture scenes, low-overlap scenes and scenes with small parallax. To combat large parallax, Liao et al. [34] proposed a depth map-based natural image stitching method to try to solve the parallax problem, which not only accurately aligns overlapping regions, but also has good virtual naturalness in non-overlapping regions. Although stitching results with content-aware naturalness can be obtained using existing stitching methods, their boundaries still suffer from irregularities.
Image retargeting
Image retargeting is the resizing of images based on content. Avidan and Shamir [2] first proposed a content-aware image retargeting method that retargets based on the acquired importance map. Image retargeting techniques based on the importance map can be broadly classified into three categories: discrete retargeting algorithms represented by line cropping [2, 18, 44, 45], continuous retargeting algorithms represented by mesh deformation [17, 25, 28, 36, 47] and multioperation retargeting algorithms [13, 63, 64]. The importance maps in most content-based retargeting methods are obtained mainly based on shallow information and have limitations in fusing shallow detail information with deep semantic information. Therefore, researchers are turning to consider the advantages of deep neural networks to obtain multi-scale semantic information of images to compensate for this shortcoming. Deep learning-based image retargeting algorithms can be divided into deep neural network-based image retargeting algorithms [6, 39, 48, 52], deep reinforcement learning-based multioperation algorithms [27, 62] and aesthetic perception-based image cropping algorithms [33, 38, 46]. Deep neural networks can not only extract rich semantic information, but also better represent the semantic structure of images, which has allowed deep learning to gradually dominate the field of image retargeting research. However, image retargeting restricts the input image to be rectangular and does not address the rectangularization of images with irregular boundaries.
Image completion
Image completion is the inpainting of missing areas in an image based on the information available in the image, but image completion techniques have uncertainties. A single image to be restored can yield multiple restoration results, so it is often necessary to add prior knowledge of the image to constrain its solution range. Traditional image completion techniques mainly include pixel diffusion-based completion methods [3, 5] and structural texture-based completion methods [11, 49, 54], but these traditional methods are only suitable for completing simple images of small areas or broken images with similar texture structures, and are not applicable to images with a large number of irregular boundaries and complex texture structures. With the rapid development of deep learning, image completion techniques are also advancing rapidly. Deep learning-based image completion methods [4, 9, 23, 24, 55, 56] introduce convolutional neural networks, generative adversarial networks, etc., to improve the overall semantic coherence of images and the quality of the complemented regions. However, image completion methods are not suitable for the synthesis of advanced semantic content and can generate information that is not originally present in the image, which is extremely unreliable in application scenarios that require content fidelity such as surveillance video.
Fig. 2 [Images not available. See PDF.]
Overall network framework diagram. Our model consists of three parts: (1) feature encoder: the input image is transformed into the feature space to obtain more robust image features containing mask information. (2) Feature extractor: a convolutional neural network module used to extract different levels of feature maps. (3) Rectangle warping network: it is composed of three sub-networks, and gradually guides the model to learn the rectangularization deformation process of the image in the way of mesh cumulative regression prediction. The input image is inversely transformed from the final warped mesh to the standard rectangular mesh, and the rectangular image is obtained by multiplying the warped mask
Image rectangling
Image rectangling is mainly used to rectangularise stitched irregular images to provide a better visual perception. Traditional image rectangling often requires two stages: local warping and global warping. He et al. [20] proposed a content-aware warping algorithm for rectangling stitched panoramic images. The design of the line-preserving energy function allows the algorithm to have a good line-preserving effect on images with rich linear structure, but this also leads to poor performance in nonlinear or other scenes. Zhang et al. [60] proposed the idea of chunking rectangularization based on He et al., which achieved a better warping effect by first chunking, then rectangularization processing and finally integration. However, the chunking process also leads to the rectangularization result ignoring the global consideration, and the resulting image is not necessarily a complete rectangular image, which is contrary to the original intention of rectangularization. Li et al. [31] improved the line-preserving energy function and proposed a content-aware image warping method that maintains the geodesic line, which has a better line-preserving effect in the panoramic image rectangling task. However, geodesics are curves resulting from the intersection of manifold, which can lead to erroneous results if the curve does come from a curved object. In addition, the method requires knowledge of the projection of the panoramic image to work effectively. Nie et al. [40] first introduced deep learning to the field of image rectangling by global warping, but the boundaries of their rectangular images still have a little content loss and incompleteness.
Methodology
Overview
Figure 2 illustrates the overall framework of the SDR based on the idea of stepwise regression. A comprehensive loss function for the rectangularization task is constructed by combining perceptual loss, appearance loss and mesh loss to constrain the content fidelity and boundary integrity of the rectangular images generated by SDR. A feature encoder is introduced to eliminate the influence of factors such as illumination differences between different regions in the input image on the mesh prediction and to enhance the applicability of the model in multiple scenes. Mask information is embedded into the encoded image and global optimization is performed to enhance the network’s ability to pay global attention to boundary information. It is passed into the feature extractor to extract the low-level detail information and the high-level semantic information of the image. The rectangle warping network is used to predict the mesh motion of different levels of feature maps, and the final accurate mesh motion is obtained by mesh momentum cumulation. At the same time, multi-level constraints are added to guide the model to gradually learn the rectangularization deformation process of the input irregular image. In comparison with general image rectangling methods, this proposed method exhibits state-of-the-art performance in both qualitative and quantitative comparisons, and the effectiveness of each module of the method is verified by ablation experiments.
SDR structure
Feature encoder Considering the impact of factors such as luminosity differences of each region in the input image on the mesh prediction network, inspired by Zhang et al. [58], a shallow feature encoder is introduced to enhance the applicability of the overall model in multiple scenes before the image is fed into the feature extractor. Specifically, the feature encoder consists of a feature transformation block and a mask embedding operation. The feature transformation block is a lightweight convolutional neural network module with three basic convolutional blocks, and the output channels of the three convolutional blocks are 32, 64 and 3, respectively. Firstly, the stitched image is passed into the feature transformation block for feature encoding, so as to eliminate the factors such as the luminosity difference of the input image, and obtain the shallow feature map which is more robust to brightness change. The is multiplied by the input mask to embed the irregular boundary information of the input image, and then it is transmitted to the subsequent network to enhance the global attention of the network to the boundary information through global optimization.
Feature extractor The feature extractor aims to extract low-level detail information and high-level semantic information from the encoded image. The module consists of eight convolutional layers and three max pooling layers. The output channels of the eight convolutional layers are 64, 64, 64, 64, 128, 128, 128 and 128, respectively. The max pooling layers are located after the 2nd, 4th and 6th convolutional layers, respectively. The purpose is to remove the redundant information from the original feature maps and to compress the features, thereby simplifying the network complexity and reducing the estimated mean shift caused by the error of the convolutional layer parameters. Each pooling layer downsamples the input at a ratio of 2 and outputs a layer of feature maps, and then the feature maps of different levels extracted by the feature extractor are used as the input of the subsequent rectangle warping network.
Rectangle warping network It can be known from the research of Daniel DeTone et al. [12] that it is difficult to directly predict the position of the mesh vertices. Therefore, we choose to indirectly obtain the position of the mesh vertices by predicting the motion vectors of each mesh vertex in the horizontal and vertical directions. In order to gradually guide the network in learning the rectangularization deformation process of an image, a rectangle warping network is introduced to predict the motion vectors of individual mesh vertices in the encoded image in a stepwise regression manner. Starting from the 8th level feature map extracted by the feature extractor, the mesh momentum is gradually predicted, and the rectangle warping network is realized by three cascaded sub-networks with independent weights. In the first sub-network, a standard rectangular mesh is first placed on and , and then the mesh momentum predictor is used to predict the motion vector of each mesh vertex in . The standard rectangular mesh is used as input and added with the primary mesh motion to obtain the primary warped mesh . In the second sub-network, the previous is used to warp the feature map , and then the warped is used as input, and the mesh momentum predictor is used to predict again. The middle warped mesh is obtained by adding the output of the first sub-network to the middle mesh motion obtained by the current prediction. In order to make the network capture more details so as to predict the mesh momentum of the original image more accurately, we choose to predict the feature maps of different levels. Firstly, is used to warp the feature map , and the warped result is used as the input of the third sub-network. The final warped mesh is obtained by accumulating the output of the second sub-network. In order to effectively reduce the risk of warped image content distortion caused by excessive deformation of the mesh during the prediction process, we constrain the rectangularization deformation of each warped mesh. Finally, the original input image and the input mask are subjected to direct linear transformation (DLT) using the three-stage warped mesh, and the warped results of the three stages are obtained.
In each sub-network, a mesh momentum predictor is included. Its role is to predict the motion vector of each mesh vertex in the horizontal and vertical directions for each feature map placed on the standard rectangular mesh, which paves the way for the subsequent warped mesh that fits the feature map itself. It consists of eight convolutional layers, three max pooling layers and three fully connected layers. The output channels of the eight convolutional layers are 256, 256, 256, 256, 512, 512, 512 and 512, respectively. The output channels of the three fully connected layers are 2048, 1024 and (V, U represent the number of girds divided in the horizontal and vertical directions of the input image, respectively), and the max pooling layer is inserted after the 2nd, 4th and 6th convolutional layers, respectively.
Fig. 3 [Images not available. See PDF.]
Feature extractor in Residual-SDR. The numbers under the yellow rectangular block represent the convolutional kernel size and stride of the convolution layer, respectively
Residual-SDR structure
Based on the consideration of the effectiveness of the stepwise regression idea adopted in SDR, we further explore the impact of model architecture design on its rectangularization performance on the basis of SDR. For this purpose, we introduce the residual structure in SDR by referring to the design of ResNet [19], and construct a residual version of SDR (Residual-SDR). Specifically, we replace the feature extractor in SDR with the residual network in Fig. 3, and design the rectangle warping network in SDR in the manner of downsampling residual block (DRB) in Fig. 3. The feature extractor in Residual-SDR consists of one convolutional layer, one max pooling layer, one non-downsampling residual block (NRB) and two DRBs. The output channels of the 1st convolutional layer and NRB are 64, and the output channels of the two DRBs are 128 and 256, respectively. The convolutional kernel size and stride of each convolutional layer are shown in Fig. 3.
Loss function
Nie et al. [40] constructed a mask loss to constrain the boundary integrity of rectangularized images, but when the predicted mesh does not completely cover the input image, the mask loss creates an adversarial effect with the appearance loss, which is detrimental to the training of the model. For this reason, we embed the mask information into the encoded image and pass it to the subsequent prediction model. The overall optimization is performed directly through the appearance loss, which indirectly constrains the boundary integrity of the rectangular image. The experimental results show that the mask embedding operation is superior to the direct use of mask loss. Therefore, we only use a combined objective function based on perceptual loss, appearance loss and mesh loss in the training process of the model to guide the network to learn rectangularized images with content fidelity and complete boundaries. The specific combination is as follows:
1
Among them, is the perceptual loss, is the appearance loss, and is the mesh loss. is the weight coefficient of to balance the importance of the three loss functions, let here.Perceptual loss
To make the rectangularized images more natural, perceptual loss is introduced to make the warped rectangular images more similar to the target rectangular images in terms of high-level semantic information such as content and global structure through feature alignment. Similar to Nie et al.’s method [40], the output of the “conv4_2” layer in Visual Geometry Group-19 (VGG19) [50] is chosen in this paper to minimize the distance between the warped rectangular image and the target rectangular image in terms of high-level semantic perception in order to enhance the detail richness and semantic coherence of the output image. The calculation method [26] is shown in Formulas (2)–(3).
2
3
Among them, extracts the features of “conv4_2” layer in VGG19, and T represents the target rectangular image, namely the rectangular label. and represent the rectangular result and predicted mesh obtained by a certain level in the rectangle warping networks, where . represents the norm, represents the warp operation, and I and M represent the input image and mask, respectively.Appearance loss
The appearance loss is used to minimize the sum of the absolute difference of all elements between the true value and the predicted value. This paper uses the appearance loss to constrain the output result to be as close to the rectangular label as possible in appearance. The calculation method is as follows:
4
Among them, , , represent the rectangular results obtained by the rectangle warping networks, represents the norm.Mesh loss
We add mesh constraint in the training process to avoid excessive deformation of the mesh as much as possible, including intra-grid loss and inter-grid loss. For the mesh loss, we follow the mesh term developed in Nie et al.’s method [40]. In order to facilitate understanding, the loss function is described in detail. Intra-grid loss is used to constrain the deformation amplitude of the mesh and prevent content distortion of rectangularized images. The projection of the grid edge on the vertical or horizontal component is calculated by the inner product formula to constrain the length of each grid edge to be greater than a threshold . The specific formula of horizontal grid edge constraint is as follows:
5
Here, W represents the width of the input image, V represents the number of grids in the horizontal direction of the input image, is a coefficient to constrain the minimum grid width, here let . represents the horizontal edge vector of the grid, and represents the horizontal unit vector with the right direction. The grid edge constraint in the vertical direction is similar to Formula (5), as follows:6
Here, H represents the height of the input image, U represents the number of grids in the vertical direction of the input image, represents the vertical edge vector of the grid, and represents the vertical unit vector in the direction downward. Intra-grid loss is obtained by using the grid edge constraints in the horizontal and vertical directions. The specific formula is shown in Equation (7):7
The inter-grid loss is used to constrain the deformation of the grid to ensure the content coherence of the rectangular image. By calculating the included angle cosine value of two adjacent edges , between the grids, the two grid edges are encouraged to be collinear as much as possible. The calculation method is shown in Formula (8).8
where , and N represents the number of tuples of two continuous grid edges in the horizontal and vertical directions.Therefore, the mesh loss function is:
9
Fig. 4 [Images not available. See PDF.]
Rectangularization effect images. Input, Mesh, Result and Label represent image with irregular boundaries, predicted final warped mesh, output rectangular result and rectangular label, respectively
Experiments
Dataset and evaluation metrics
In this study, the proposed method is trained and tested on the DIR-D dataset [40]. This dataset contains multiple images of different scenes such as low-texture and low-light scene, different picture texture features and different structures such as linear and nonlinear, and the training set and test sets contain 5839 and 519 images and corresponding masks and rectangular labels, respectively. At the same time, in order to verify that the model proposed in this paper has excellent generalization performance, we select another UDIS-D dataset [43] for cross-dataset verification, in which the UDIS-D dataset comes from the real-world stitching scene.
For the evaluation metrics of rectangular images, we choose structural similarity (SSIM), peak signal-to-noise ratio (PSNR) and Fréchet inception distance (FID) [21], where SSIM takes into account brightness, contrast and structure to better simulate the human eye’s perception of image quality. It is a metric of the similarity of two images, with larger values representing more structural information preserved in the rectangular image. PSNR is a variable used to measure the ratio of valid information to noise in the resulted image. A larger value means that more detail information of the original image is preserved in the rectangular result. FID is a metric that calculates the distance between the feature vectors of the real image and the generated image, and a lower score means that the two sets of images are more similar.
Implementation details
The batch size set for our network model during training is 4 and the number of iterations is 100k. To make the model converge more stably, it was trained using an exponential decay learning rate with an initial learning rate of , a decay coefficient of 0.96 and a decay rate of 50,000/4. The Adam optimizer is used to optimize the objective function throughout the training process. The number of grids V and U divided in the horizontal and vertical directions of the input image is 8 and 6, respectively. The model is implemented based on Tensorflow, and all experiments are performed on a single GPU of NVIDIA RTX 3080 with 16 G video memory.
Rectangularization effect evaluation
The test images are divided into eight scenes, containing linear structure (Linear), nonlinear structure (Nonlinear), long shot (LS), close shot (CS), low-light scene (LL), high-light scene (HL), low-texture scene (LT) and scene with moving foreground (MF). Some of these images are rectangularized as shown in Fig. 4. As can be seen, the method in this paper can accurately predict the warped mesh of the original image and generate a rectangular image with complete boundary in various scenes.
To ensure fairness in the comparison, the proposed method is compared with two general image rectangling methods [20, 40] in terms of both qualitative and quantitative comparisons, and the parameter settings of the compared methods are kept consistent with the default parameters provided by the original authors.
Quantitative comparison
In order to measure the performance difference between different image rectangling methods, we perform a quantitative comparison in terms of the number of model parameters, rectangularization rate and image evaluation metrics. Table 1 shows the comparison results of our method and other image rectangling methods on the DIR-D dataset. The meaning of Reference is to use the input boundary irregular image as a rectangular result, which is used as a reference in the comparison of evaluation metrics, and is ignored in the comparison of algorithm efficiency. Given that He et al.’s method is a traditional algorithm, we only calculate the number of model parameters of Nie et al.’s method based on deep learning and the method proposed in this paper.
Table 1. Quantitative comparison on DIR-D
Method | #Params | Time | SSIM | PSNR | FID |
|---|---|---|---|---|---|
Reference | – | – | 0.3245 | 11.3048 | 44.47 |
He et al.’s [20] | – | 0.4936 s | 0.3775 | 14.7012 | 38.19 |
Nie et al.’s [40] | 50.91 M | 0.1244 s | 0.7141 | 21.2764 | 21.77 |
Ours | 76.06 M | 0.1587s | 0.7232 | 21.5400 | 21.50 |
Ours* | 78.80 M | 0.1116 s | 0.7247 | 21.5672 | 21.53 |
Time represents the average time spent of image rectangling on the DIR-D dataset. In order to eliminate the error caused by resource loading when the first round of the model is started, each algorithm runs 11 rounds continuously and retains the mean of the last 10 rounds of results. Ours * represents Residual-SDR
As can be seen from Table 1, our method significantly outperforms the other two methods for each image evaluation metric. Compared to the methods of He et al. and Nie et al., our method improves by 91.58 and 1.27% in the SSIM metric, 46.52 and 1.24% in the PSNR metric, and reduces 43.70 and 2.62% in the FID metric, respectively. For this reason, He et al.’s method can only preserve linear structure, which results in undesired image distortion and content loss when dealing with nonlinearly structured objects such as portraits and landscapes; Nie et al.’s method lacks consideration of multi-scene adaptivity and multi-level semantic information, resulting in a lack of scene generalization and edge information preservation. In contrast, our method introduces a feature encoder from the perspective of general rectangularization to fully take into account scene adaptiveness, and constructs a rectangle warping network to take into account multi-level semantic information. Finally, the most accurate mesh motion is obtained by mesh cumulative regression prediction, so as to achieve a satisfactory rectangularization effect.
Fig. 5 [Images not available. See PDF.]
Quantitative comparison in multiple scenarios. The red line indicates the proposed method and the legend shows the metric averages on the DIR-D test set for Nie et al.’s method and the proposed method
It can be seen from Table 1 that our model constructed based on the idea of stepwise regression exhibits the optimal rectangular performance, which indicates that the improvement of SDR in the idea of stepwise regression is effective, but the stepwise regression is also slightly reduced in terms of operating efficiency. For this reason, we optimize the SDR network and construct a residual version of the SDR. It can be seen that Residual-SDR successfully improves the rectangularization rate and further improves the model performance.
For further comparison with Nie et al.’s method, comparative experiments are conducted on each scenario and the results are shown in Fig. 5. It can be seen that our method outperforms Nie et al.’s method in every scene, which also validates that SDR has strong scene generalization and robustness. It is worth mentioning that our method significantly outperforms Nie et al.’s method in high-light scene. This is due to the fact that there are luminosity differences in various regions of the image in high-light scene, and in order to overcome the effect of luminosity differences on mesh prediction, we introduce a feature encoder, so that SDR can show a significant advantage in high-light scene.
Qualitative comparison
Figure 6 shows the results of the qualitative comparison. It can be seen that our method clearly outperforms the comparison methods, thanks to our pre-processing approach of embedding mask information into the encoded image and adding appearance loss, perceptual loss and mesh loss during the training process to give the model good shape and content retention; on the other hand, thanks to the rectangle warping network that combines multi-level semantic information to gradually guide the model to learn how to optimize the boundaries.
Figure 6 indicates that the rectangular results corresponding to our method have good boundary integrity. In order to fully evaluate our method, we perform content fidelity validation. Specifically, we use the method in DAMG [41] to fuse the rectangular images obtained by different rectangling methods with the rectangular label images to show the rectangularization performance of different methods. When the image information of the two is not aligned, there will be blue or orange artifacts, and the comparison results are shown in Fig. 7. Through comparison, it is found that our method also performs well in content fidelity.
Fig. 6 [Images not available. See PDF.]
Comparison of rectangular images. Yellow arrows represent warped failure regions, blue boxes represent content loss contrast regions, and red circles represent boundary incomplete regions
Fig. 7 [Images not available. See PDF.]
Rectangular result channel fusion comparison diagram. The value of the blue channel of the rectangular result, the value of the red channel of the real rectangular label and the average value of the green channel of the two are taken as the values of the blue, red and green channels of the fusion image. Among them, the red box and the yellow box represent the content deformation degree comparison area, and the blue circle represents the boundary incomplete area
Cross-dataset evaluation
In order to better verify that SDR has excellent model generalization, we evaluate the rectangularization effect across dataset using the UDIS-D dataset in real-world image stitching scenario. Firstly, we use the LB-UDHN [43] stitching algorithm to stitch the UDIS-D dataset originating from real-world image stitching scenario, and then we use different rectangling methods to rectangularize the stitching results. The comparison results are shown in Fig. 8. It is not difficult to see that the rectangular image obtained by He et al.’s method still has some warped failures, content loss and boundary incomplete phenomena. Compared with He et al.’s method, Nie et al.’s method performs well, but there are still some inevitable boundary incomplete or content loss phenomena. Overall, our method has better rectangularization performance. This is consistent with the comparison results obtained in Sect. 4.3.2 on the DIR-D dataset.
Fig. 8 [Images not available. See PDF.]
Comparison of rectangular results of real-world stitched images. Yellow arrows represent warped failure regions, blue boxes represent content loss contrast regions, and red circles represent boundary incomplete regions
Ablation study
We conduct ablation experiments on the DIR-D test set and use SSIM, PSNR metrics to verify the validity of each module of the model.
Feature transformation block Considering the effect of factors such as luminosity differences in the regions of input image, we perform a feature transformation on the input image before input to the feature extractor and compared it with combining the input image and mask directly without the feature transformation, resulting in the comparison results shown in Table 2. It can be seen that the feature transformation block helps to enhance the scene generalization of the model. Because feature transformation block has representational learning capabilities and can learn shallow feature representations of an image, the addition of feature transformation block can overcome the effect of luminosity differences on mesh prediction compared to pixel intensities, thus significantly improving the robustness and generalization of the model.
Mask embedding operation Mask loss is another form of considering mask information, which constrains the warped mask around an all-one matrix to achieve as rectangular a boundary as possible. The traditional boundary constraint is achieved by adding mask loss, but this constraint works against the appearance loss when the predicted mesh area is smaller than the input image area, so we embed the mask information of the original image into the encoded feature map , and achieve global optimization of the model for content and boundary information through the appearance loss.
As can be seen from Table 3, the mask information positively guides the prediction of the model, and embedding the mask information without the mask loss significantly improves the performance of the model. The reason why the rectangularization effect is weakened by adding mask loss is that the mask loss constrains the warped mask, and if the predicted mesh does not contain all the information of the image, then the warped mask, even if it is an all-one matrix, does not indicate that the model achieves a satisfactory rectangularization effect. At the same time, the mask loss in this case will also fight against the appearance loss and affect the direction of model convergence, forming the phenomenon of repeated fluctuations in loss, which is not conducive to the optimal solution of the model. In addition, it is worth mentioning that either the concatenate operation or the multiply operation will significantly improve the mesh prediction performance of the model as long as the mask information is embedded in the encoded image, but the multiply operation is finally chosen to ensure the consistency of the mask embedding operation before and after the model.
Fig. 9 [Images not available. See PDF.]
Rectangular results of different warping modes
Perceptual loss In this paper, perceptual loss is introduced to make the rectangular results more natural in perception. Based on the ability of the convolutional layer to extract high-level features, we select the features output from the “conv4_2” layer in VGG19 to optimize the model parameters and compare them with the multi-layer perceptual loss () of selecting both “conv3_2” and “conv4_2,” and the comparison results obtained are shown in Table 4. Compared to using both “conv4_2” and “conv3_2,” we find that the model only using the higher-level perceptual loss of “conv4_2” has better rectangularization performance because the deeper small-resolution feature maps relax the network’s constraints on the magnitude of rectangular deformation and help the network to explore better mesh warping locations. Also, the higher-level feature maps are able to capture high-level semantic information of the images, which is more suitable for abstract alignment of rectangularized images.
Table 2. Verification of feature transformation block
Feature transformation block | SSIM | PSNR |
|---|---|---|
W/o | 0.7131 | 21.3638 |
W/ | 0.7232 | 21.5400 |
Table 3. Verification of mask operation and mask loss
Mask operation mode | SSIM | PSNR | |
|---|---|---|---|
W/o | w/o | 0.6554 | 20.3543 |
Concatenate | w/ | 0.7129 | 21.4142 |
Concatenate | w/o | 0.7247 | 21.5384 |
Multiply | w/o | 0.7232 | 21.5400 |
Table 4. Comparison of different perceptual losses
SSIM | PSNR | ||
|---|---|---|---|
W/o | w/ | 0.7170 | 21.4493 |
W/ | w/o | 0.7232 | 21.5400 |
Rectangle warping network Unlike Nie et al.’s method, we add consideration of multi-level feature warping and design a multi-level rectangle warping network. The model is gradually guided through three levels of warping to learn the ability to obtain rectangular images with content fidelity and complete boundaries. It can be seen from Table 5 that compared with the single-level warping network using only the same level of features, the rectangle warping network using multi-level features improves the SSIM and PSNR metrics by 2 and 0.8%, respectively. From Fig. 9, it can also be seen that the rectangular results obtained using the rectangle warping network have more complete boundaries.
Table 5. Verification of rectangle warping network
Warping mode | SSIM | PSNR |
|---|---|---|
Same-level warping | 0.7090 | 21.3566 |
Multi-level warping | 0.7232 | 21.5400 |
Conclusion
In this paper, aiming at the problems of poor scene generalization ability and incomplete boundary content in the current rectangularization research, a stepwise deep rectangling model is constructed. A feature encoder is introduced to improve the scene generalization of the model, and a rectangle warping network is constructed to perform mesh cumulative regression prediction on multi-level feature maps. The resulting rectangular image has content fidelity and higher boundary integrity. Compared with traditional method and deep learning-based method, our proposed solution shows the optimal rectangularization performance in a series of comparative experiments. Finally, the effectiveness of each module in the model is verified by ablation experiments.
However, our method also has some limitations. Since we design a rectangle warping network to predict the final accurate mesh motion in a stepwise regression manner, our method exhibits the best rectangularization performance in both qualitative and quantitative comparisons. But also because of this, it inevitably brings the increase of the rectangularization time. Therefore, future research directions should focus on exploring how to optimize the network to improve the efficiency of rectangularization while maintaining high performance, such as designing more effective model architectures or introducing new technologies to accelerate the rectangularization process.
Acknowledgements
This work was supported by the Natural Science Foundation Key Project of Gansu Province (No. 23JRRA860), the Key Talent Project of Gansu Province, the Inner Mongolia Key R &D and Achievement Transformation Project (Nos. 2023YFSH0043, 2023YFDZ0043) and the Key Research and Development Project of Lanzhou Jiaotong University (No. ZDYF2304)
Data availability statement
The DIR-D and UDIS-D datasets analyzed in this study are available at https://pan.baidu.com/s/1aNpHwT8JIAfX_0GtsxsWyQ and https://pan.baidu.com/s/13KZ29e487datgtMgmb9laQ, respectively, and the extraction code is 1234.
Declarations
Conflict of interest statement
All authors declare that there is no conflict of interest that violates the fairness of academic research.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Anzid, H; Le Goïc, G; Bekkari, A; Mansouri, A; Mammass, D. A new SURF-based algorithm for robust registration of multimodal images data. Vis. Comput.; 2023; 39,
2. Avidan, S., Shamir, A.: Seam carving for content-aware image resizing. In: ACM SIGGRAPH 2007 Papers. 10–es (2007)
3. Ballester, C; Bertalmio, M; Caselles, V; Sapiro, G; Verdera, J. Filling-in by joint interpolation of vector fields and gray levels. IEEE Trans. Image Process.; 2001; 10,
4. Cai, N; Zhenghang, S; Lin, Z; Wang, H; Yang, Z; Ling, BW-K. Blind inpainting using the fully convolutional neural network. Vis. Comput.; 2017; 33,
5. Chan, TF; Jianhong, S. Nontexture inpainting by curvature-driven diffusions. J. Vis. Commun. Image Represent.; 2001; 12,
6. Chang, C.-H., Chuang, Y.-Y.: A line-structure-preserving approach to image resizing. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1075–1082. IEEE (2012)
7. Chang, C.H., Sato, Y., Chuang, Y.Y.: Shape-preserving half-projective warps for image stitching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3254–3261 (2014)
8. Chen, Y.-S., Chuang, Y.-Y.: Natural image stitching with the global similarity prior. In: Part, V. (ed.) Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, pp. 186–201. Springer (2016)
9. Chen, Y; Liu, L; Tao, J; Xia, R; Zhang, Q; Yang, K; Xiong, J; Chen, X. The improved image inpainting algorithm via encoder and similarity constraint. Vis. Comput.; 2021; 37,
10. Chen, J; Zhenpeng, F; Huang, J; Xinrong, H; Peng, T. Boosting vision transformer for low-resolution borehole image stitching through algebraic multigrid. Vis. Comput.; 2022; 38,
11. Criminisi, A; Pérez, P; Toyama, K. Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process.; 2004; 13,
12. DeTone, D., Malisiewicz, T., Rabinovich, A.: Deep image homography estimation. arXiv preprint arXiv:1606.03798 (2016)
13. Dong, W-M; Bao, G-B; Zhang, X-P; Paul, J-C. Fast multi-operator image resizing and evaluation. J. Comput. Sci. Technol.; 2012; 27,
14. Gaddam, VR; Riegler, M; Eg, R; Griwodz, C; Halvorsen, P. Tiling in interactive panoramic video: approaches and evaluation. IEEE Trans. Multimedia; 2016; 18,
15. Gao, J; Jun, W; Zhao, X; Gang, X. Integrating TPS, cylindrical projection, and plumb-line constraint for natural stitching of multiple images. Vis. Comput.; 2023; 2023, pp. 1-30.
16. Garg, A; Singh, AK. Analysis of seam carving technique: limitations, improvements and possible solutions. Vis. Comput.; 2023; 39,
17. Guo, Y; Liu, F; Shi, J; Zhou, Z-H; Gleicher, M. Image retargeting using mesh parametrization. IEEE Trans. Multimedia; 2009; 11,
18. Han, D; Sonka, M; Bayouth, J; Xiaodong, W. Optimal multiple-seams search for image resizing with smoothness and shape prior. Vis. Comput.; 2010; 26,
19. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
20. He, K; Chang, H; Sun, J. Rectangling panoramic images via warping. ACM Trans. Graph. (TOG); 2013; 32,
21. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local Nash equilibrium. Adv. Neural Inf. Process. Syst. 30 (2017)
22. Hossein-Nejad, Z; Nasri, M. Clustered redundant keypoint elimination method for image mosaicing using a new Gaussian-weighted blending algorithm. Vis. Comput.; 2022; 38,
23. Iizuka, S; Simo-Serra, E; Ishikawa, H. Globally and locally consistent image completion. ACM Transactions on Graphics (ToG); 2017; 36,
24. JieJie, X; Zhu, Y; Wang, W; Liu, G. A real-time semi-dense depth-guided depth completion network. Vis. Comput.; 2023; 2023, pp. 1-11.
25. Jin, Y; Liu, L; Qingbiao, W. Nonhomogeneous scaling optimization for realtime image resizing. Vis. Comput.; 2010; 26, pp. 769-778. [DOI: https://dx.doi.org/10.1007/s00371-010-0472-8]
26. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part II 14, pp. 694–711. Springer (2016)
27. Kajiura, N., Kosugi, S., Wang, X., Yamasaki, T.: Self-play reinforcement learning for fast image retargeting. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1755–1763 (2020)
28. Karni, Z., Freedman, D., Gotsman, C.: Energy-based image deformation. In: Computer Graphics Forum, vol. 28, pp. 1257–1268. Wiley Online Library (2009)
29. Kopf, J; Kienzle, W; Drucker, S; Kang, SB. Quality prediction for image completion. ACM Trans. Graph. (ToG); 2012; 31,
30. Krishnakumar, K; Indira Gandhi, S. Video stitching based on multi-view spatiotemporal feature points and grid-based matching. Vis. Comput.; 2020; 36,
31. Li, D., He, K., Sun, J., Zhou, K.: A geodesic-preserving method for image warping. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 213–221 (2015)
32. Li, J; Wang, Z; Lai, S; Zhai, Y; Zhang, M. Parallax-tolerant image stitching based on robust elastic warping. IEEE Trans. Multimedia; 2017; 20,
33. Li, D; Huikai, W; Zhang, J; Huang, K. Fast a3rl: aesthetics-aware adversarial reinforcement learning for image cropping. IEEE Trans. Image Process.; 2019; 28,
34. Liao, T., Li, N.: Natural image stitching using depth maps. arXiv preprint arXiv:2202.06276 (2022)
35. Lin, K., Jiang, N., Cheong, L.F., Do, M., Lu, J.: Seagull: Seam-guided local alignment for parallax-tolerant image stitching. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part III 14, pp. 370–385. Springer (2016)
36. Lin, SS; Yeh, IC; Lin, CH; Lee, TY. Patch-based image warping for content-aware retargeting. IEEE Trans Multimedia; 2012; 15,
37. Liu, Y; Deng, Yu; Chen, X; Li, Z; Fan, J. TOP-SIFT: the selected SIFT descriptor based on dictionary learning. Vis. Comput.; 2019; 35,
38. Lu, P., Liu, J., Peng, X. and Wang, X.: Weakly supervised real-time image cropping based on aesthetic distributions. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 120–128 (2020)
39. Mastan, I.D., Raman, S.: Dcil: Deep contextual internal learning for image restoration and image retargeting. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2366–2375 (2020)
40. Nie, L., Lin, C., Liao, K., Liu, S., Zhao, Y.: Deep rectangling for image stitching: a learning baseline. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5740–5748(2022)
41. Nie, L., Lin, C., Liao, K., Liu, S., Zhao, Y.: Depth-aware multi-grid deep homography estimation with contextual correlation. arXiv preprint arXiv:2107.02524 (2021)
42. Nie, L; Lin, C; Liao, K; Liu, M; Zhao, Y. A view-free image stitching network based on global homography. J. Vis. Commun. Image Represent.; 2020; 73,
43. Nie, L; Lin, C; Liao, K; Liu, S; Zhao, Y. Unsupervised deep image stitching: reconstructing stitched features to images. IEEE Trans. Image Process.; 2021; 30,
44. Noh, H., Han, B.: Seam carving with forward gradient difference maps. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 709–712 (2012)
45. Oliveira, SA; Neto, ARR; Bezerra, FN. A novel Genetic Algorithms and SURF-Based approach for image retargeting. Expert Syst. Appl.; 2016; 44,
46. Peng, L; Zhang, H; Peng, X; Jin, X. Learning the relation between interested objects and aesthetic region for image cropping. IEEE Trans. Multimedia; 2020; 23,
47. Shi, M., Yang, L., Peng, G., Xu, D.: A content-aware image resizing method with prominent object size adjusted. In: Proceedings of the 17th ACM Symposium on Virtual Reality Software and Technology, pp. 175–176 (2010)
48. Shocher, A., Bagon, S., Isola, P., Irani, M.: Ingan: capturing and retargeting the “DNA” of a natural image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4492–4501(2019)
49. Simakov, D., Caspi, Y., Shechtman, E., Irani, M.: Summarizing visual data using bidirectional similarity. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)
50. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
51. Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K., Lempitsky, V.: Resolution-robust large mask inpainting with Fourier convolutions. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2149–2159 (2022)
52. Tan, W; Yan, B; Lin, C; Niu, X. Cycle-IR: deep cyclic image retargeting. IEEE Trans. Multimedia; 2019; 22,
53. Teterwak, P., Sarna, A., Krishnan, D., Maschinot, A., Belanger, D., Liu, C., Freeman, W.T.: Boundless: generative adversarial networks for image extension. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10521–10530 (2019)
54. Wexler, Y; Shechtman, E; Irani, M. Space-time completion of video. IEEE Trans. Pattern Anal. Mach. Intell.; 2007; 29,
55. Yan, Z., Li, X., Li, M., Zuo, W., Shan, S.: Shift-net: image inpainting via deep feature rearrangement. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 1–17 (2018)
56. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated convolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4471–4480 (2019)
57. Zhang, F., Liu, F.: Parallax-tolerant image stitching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3262–3269 (2014)
58. Zhang, J., Wang, C., Liu, S., Jia, L., Ye, N., Wang, J., Zhou, J., Sun, J.: Content-aware unsupervised deep homography estimation. In: Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pp. 653–669. Springer (2020)
59. Zhang, J; Xiu, Y. Image stitching based on human visual system and SIFT algorithm. Vis. Comput.; 2023; 2023, pp. 1-13.
60. Zhang, Y; Lai, Y-K; Zhang, F-L. Content-preserving image stitching with piecewise rectangular boundary constraints. IEEE Trans. Visual Comput. Graph.; 2020; 27,
61. Zhang, J; Gao, Y; Yi, X; Huang, Y; Yanming, Yu; Shu, X. A simple yet effective image stitching with computational suture zone. Vis. Comput.; 2023; 39,
62. Zhou, Y; Chen, Z; Li, W. Weakly supervised reinforced multi-operator image retargeting. IEEE Trans. Circuits Syst. Video Technol.; 2020; 31,
63. Zhu, L., Chen, Z., Chen, X., Liao, N.: Saliency & structure preserving multi-operator image retargeting. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1706–1710. IEEE (2016)
64. Zhu, L., Chen, Z.: Fast genetic multi-operator image retargeting. In: 2016 Visual Communications and Image Processing (VCIP), pp. 1–4. IEEE (2016)
Copyright Springer Nature B.V. Jan 2025