This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Object detection and pose estimation are important computer vision tasks with numerous practical applications, such as self-driving cars, security systems, and robotics. Object detection involves identifying the presence and location of objects of interest within an image or video, while pose estimation is the process of determining the spatial location and orientation of objects in a scene. With object detection, the system can identify and track objects in real time, for automated image and video analysis. Pose estimation is critical in various fields, including robotics, gaming, augmented reality, and medicine. For instance, in robotics, pose estimation helps robots understand their position relative to objects in their environment, enabling them to perform tasks accurately (such as grasping and manipulating objects with their dexterous hands). Overall, object detection and pose estimation are important computer vision techniques that enable machines to interact with and understand the world around them, which can be used for further scene understanding and manipulation of the environment. Currently, there exists a number of object detection or pose estimation methods using RGB information for point cloud sensing [1, 2]. However, their performance is limited by the single modal information. With the rapid development of hybrid sensors, RGB-D cameras’ simultaneous processing of the joint information from the visual (RGB) and depth sensing has offered a more efficient approach for both object detection and pose estimation. Therefore, it is possible to develop a complete framework for object detection and pose estimation using RGB-D cameras, which can help robots to understand the environment and make decisions.
To estimate the pose of objects, some early RGB-D–based methods [3–6] estimate a relative pose between the objects and a predefined model (i.e., a CAD model defined in its canonical pose). They extract handcrafted 3D or 2D features from the scene to establish correspondences and then generate pose hypotheses. Finally, hypothesis verification is conducted by evaluating the quality of hypothesis (such as inlier ratio of established correspondences) to select the best pose. Since the 3D/2D features are extracted from the scene, there exist a number of outliers which do not belong to the object. Therefore, the correspondences are usually not reliable enough. Moreover, the quality of features will degenerate when the point clouds are textureless. As a result, some learning-based methods, such as PoseCNN [7], have been proposed, which attempt to fuse RGB-D information and directly regress the pose of an object through the network. The method first leverages a 2D segmentation network to obtain the masks of objects and then simultaneously uses a convolutional neural network (CNN) [8] and a PointNet [9] network to extract pixel-wise 2D features and a 3D features in the masked regions, which are subsequently fused by a multilayer perceptron (MLP) block. Finally, every pixel-wise fused feature is used to predict a pose and a confidence score for hypothesis selection.
However, these methods are limited in four aspects. Firstly, several approaches directly extracting feature descriptors from the whole scene to establish correspondences. This approach inevitably introduces interference from background information, making it challenging to obtain precise correspondence. Secondly, these methods are fragile to the diversity in object poses. The pose of the object falls in continuous SE (3) space- and learning-based methods require an extensive data to learn and identify various poses. Thirdly, some methods directly regress poses of objects by learned high-dimensional features, which are sensitive to occlusion, resulting in failed pose estimation. Finally, these data-driven methods are sensitive to data distribution and thus lack generalization capability.
In this paper, we present an experimental study of object detection and pose estimation pipeline, as shown in Figure 1. Firstly, we utilize an advanced object detection method, such as YOLO v8 [10], to detect the previously trained objects and their corresponding bounding boxes, segmentation masks in the RGB images and predict the category labels. This step can eliminate the interference of background information for more accurate pose estimation. Secondly, after detection of the bounding box, we extract the depth information of the corresponding area from the depth map and generate a point cloud using intrinsic parameters. Finally, we use a scaled alignment method to align the model data with the scanned data to obtain the object’s pose. Since the alignment algorithms, such as SICP [11] and CPD [12], are sensitive to initial poses, we present a multipose initialization scheme to accurately estimate object poses. Since the alignment algorithm is robust to outliers of point clouds, our method can handle the occluded data more stably. Moreover, our pose estimation methods can help us find the best match model from the model library, thus achieving a more accurate and complete representation of scanned objects.
[figure(s) omitted; refer to PDF]
Our study also explores the potential of a hybrid sensing technology that integrates RGB and depth sensing for enhanced object detection in a given scene. Unlike methods that predominantly rely on point-cloud data, such as PointNet++ [13], our approach bifurcates the problem into two distinct phases. Initially, we focus on RGB-based object detection to identify and locate objects within the scene. Subsequently, we utilize the CAD point clouds of these detected objects to determine their precise poses. This two-phased approach addresses the limitations encountered in the object detection phase, particularly in accurately establishing the objects’ poses relative to the sensor. This novel methodology aims to bridge the gap in current sensing technologies, offering a more comprehensive and accurate object detection and pose estimation solution.
This kind of technology has numerous applications in industrial settings, such as in automation and robotics, quality control and inspection, human–robot collaboration, and more. In automation and robotics, it can assist robots in accurately identifying and handling objects, enhancing production efficiency. In quality control, it can detect defects in products by comparing the observed pose and orientation of an object with its expected configuration, ensuring product quality. These applications not only improve the efficiency and safety of work processes but also contribute to the development of more intelligent and adaptive industrial environments.
After introducing the topic in the Introduction, Section 2 provides an overview of the relevant literature, while Section 3 describes the methodology used to estimate the poses of objects. In Section 4, we present an experimental study of the presented pose estimation framework. Finally, we conclude the paper with a summary of our main findings and suggestions for future research in Section 5.
2. Related Works
2.1. 2D Object Detection
Object detection is a computer vision task that involves identifying objects within an image or video and localizing them with a bounding box. Object detection is an essential task for many applications, including autonomous vehicles, surveillance, and robotics. The objection detection methods can be categorized into single-stage detectors, two-stage detectors, and multistage detectors. Single-stage detectors use a single pass through the neural network to detect objects in an image. Examples of single-stage detectors include YOLO [10], SSD [14], and RetinaNet [15]. YOLO is one of the most classic methods and is constantly being improved [16–21].The key idea behind YOLO is to simultaneously predict object bounding boxes and class probabilities for the objects present in an image. This is achieved by dividing the input image into a grid of cells and predicting the bounding box and class probabilities for each cell. The output of the network is a set of bounding boxes and corresponding class probabilities for all objects detected in the image. These one-stage detectors are generally faster and more efficient than two-stage detectors. Two-stage detectors use a two-stage approach to detect objects. In the first stage, a region proposal network is used to generate candidate object regions, and in the second stage, these regions are classified as object or nonobject. Two-stage detectors generally have higher accuracy than single-stage detectors but are slower. Examples of two-stage detectors include Faster R-CNN [22], R-FCN [23], and Mask R-CNN [24]. Moreover, there are some methods that use a multistage pipeline, such as Cascade R-CNN [25] and HTC [26]. In the first stage, a region proposal network generates coarse object proposals. In the second stage, these proposals are refined using a bounding box regression network. Finally, in the third stage, object classification is performed. They are generally the most accurate but are also the slowest.
2.2. Pose Estimation
Object pose estimation [27] is an important task in computer vision that aims to estimate the 3D position and orientation of an object from an input image or point cloud. Over the years, researchers have developed various methods to address this problem, ranging from classical algorithms to deep learning–based models. One popular approach for object pose estimation is based on RGB images. Classical methods such as the Perspective-n-Point (PnP) algorithm [28] and feature-based methods have been widely used for this task. PnP assumes a calibrated camera and solves for the object’s rotation and translation relative to the camera, while feature-based methods extract feature points from the object and match them with those in the image. In recent years, deep learning–based methods such as PoseNet [29] and PVNet [30] have also showed promising results in estimating object pose from RGB images. The advantage of these methods is that RGB images are easily accessible and can be captured using a standard camera. However, the performance of these methods is limited by that RGB images are prone to occlusion, lighting changes, and background clutter. Moreover, they also struggle with estimating the pose of objects with similar appearances.
Another approach for object pose estimation is based on point clouds. Point cloud registration methods, such as the iterative closest point (ICP) and random sample consensus [28] (RANSAC), are usually used to estimate poses of objects. More recently, deep learning–based methods such as CATRE [31] have also showed promising results in this area. The pipeline of these methods involve extracting features from point clouds, regressing the object’s pose using a deep neural network, and refining the predicted pose using postprocessing techniques. These methods are more robust to occlusion, lighting changes, and background clutter compared to RGB-based methods.
Recently, some methods based on RGB-D images have been proposed, which can leverage the advantages of two modalities. For example, DenseFusion uses a dense fusion network to fuse the RGB and depth information and estimate the object’s pose. The key idea behind the method is to use a dense correspondence network to establish correspondences between the RGB and depth images, and then use these correspondences to refine the object’s pose estimation. However, the method requires a large amount of annotated data to train the network and can be computationally expensive at inference time.
2.3. Point Cloud Alignment
Point cloud alignment aims to solve rigid transformation between two point clouds, and this task has been widely studied by traditional methods and learning-based methods.
The ICP algorithm [32] is the most classical local registration method, which iteratively estimates correspondences and refines the pose until it converges to a local optimum. However, the ICP algorithm is sensitive to initial poses, noise, outliers, scale differences, and partial overlap. To address these issues, various variants of ICP have been proposed, such as point-to-plane ICP, trimmed ICP (TrICP) [33], generalized ICP [34], go-ICP [35], scale-ICP, and AA-ICP [36]. In addition to these variants, probabilistic registration methods such as coherent point drift (CPD) [12] have been developed to model the registration process as a probabilistic framework. CPD estimates the probability distribution of the correspondences and uses the expectation-maximization (EM) algorithm to optimize the registration. These probabilistic methods have showed to be effective in handling noise and outliers and achieving robust registration results.
Recently, deep learning–based methods have been proposed to overcome the limitations of traditional ICP. Scan2CAD [37] alignment method uses a 3D CNN to learn a shared embedding between 3D scans and CAD models. However, the alignment process is time-consuming due to exhaustive CAD model retrieval and color information is omitted. SceneCAD [38] method presents a graph neural network (GNN) to establish relational inference on 3D objects and scene layout. This method uses 3D-SIS [39] for object detection and classification, CAD model retrieval. Obtained correspondences, it also uses a Procrutes [40] method for transformation estimation. Compared to the previous methods, SceneCAD improves the accuracy and efficiency of the alignment by incorporating object-layout relational inference.
3. Methods
In this paper, our goal is to experimentally study the performance of estimating poses of objects by aligning virtual point clouds of CAD models to the detected objects. Given an image pair (I, D) scanned by RGB-depth camera, where
3.1. Object Detection and Segmentation on RGB Images
In our pose estimation pipeline, we first detect locations of objects based on the captured RGB image I. Object detection is a critical task in computer vision field and has been widely studied. Here, we use deep learning–based object detection algorithms such as MASK R-CNN or YOLO v8 which can automatically extract features from the image and predict the bounding box, class label of each object. More importantly, it can also predict instance segmentation masks, which can be used to filter out background information in the bounding boxes and enable more accurate pose estimation in the next step. Besides, we can leverage the label information to find CAD models with same category.
3.2. Point Cloud Generation From Depth Images
Given bounding boxes and instance segmentation masks of objects, we can generate corresponding point clouds by using the depth image and the known intrinsic parameters of the camera [41]. For each pixel with valid depth value, the corresponding point can be generated by
3.3. CAD Model Retrieval and Alignment
After detecting objects in the scene and segmenting their corresponding point clouds, the objective is to align the sensed point clouds to CAD models fixed in the canonical pose, enabling pose estimation of objects. In this section, we first introduced an overview of some point cloud alignment algorithms, which are in general very sensitive to the initial poses between scanned objects and CAD models and will generally lead to failed alignments. We present a multipose estimation scheme to overcome such misalignment. Finally, we experimentally study these algorithms to align scanned point clouds of objects to CAD models, and present an evaluation criterion to measure the quality of different alignment results that is used to select the best aligned CAD model.
3.3.1. Scalable Point Cloud Alignment Algorithms
Given two point clouds
ICP algorithm: The basic idea behind ICP is to iteratively find the best transformation between two point clouds by minimizing the distance between corresponding points in the two clouds [32]. In each iteration, there are two basic steps, correspondence estimation and transformation optimization. More specifically, given two point clouds P and M, nearest neighborhoods are searched between two point clouds to establish correspondences. For a point
Based on current correspondences, the second step is to compute the optimal homogeneous transformation matrix,
SICP algorithm: SICP algorithm is an extension of the traditional ICP algorithm that can address the problem of scale differences between two point clouds [11]. Unlike the traditional ICP algorithm, SICP algorithm introduces a scale factor as an optimization variable, which is a positive number to scale point clouds. Moreover, it also adds a constraint to limit the range of scale factor values that can make the algorithm more stable and prevent it from convergence to in local optima.
Compared to the traditional ICP algorithm, the biggest difference of SICP algorithm is the form of the optimization objective function. For point pairs (
Overall, the SICP algorithm has several advantages over the traditional ICP algorithm, such as the ability to address the scale difference problem, more accurate point cloud registration, and improved robustness against local optima. However, it also has some drawbacks, such as higher computational complexity and lower robustness to noisy data compared to the traditional ICP algorithm.
CPD: CPD is another algorithm for point cloud alignment, which can also handle scale differences between two point clouds [12]. Different with variations of ICP algorithms that estimate correspondence among points, the CPD algorithm represents the model point cloud as a Gaussian mixture model (GMM), where the points in the model point cloud are considered as centroids of GMM components with equal isotropic covariances
To be more specific, the model point cloud is regarded as a GMM, and then we can compute the probability of each source point
After obtaining the soft correspondences
3.3.2. Multipose Initialization Scheme
Predicting poses of objects from point clouds is a challenging task due to the significant variation in the poses of detected objects. As a result, the difference between the poses of the scanned objects and CAD models may be substantial. While algorithms such as SICP and CPD are effective at aligning two point clouds, they can fail to do so when the initial poses of two point clouds are significantly different. In order to address this issue, we propose a multipose initialization scheme that improves the robustness and accuracy of the alignment. For example, as shown in Figure 2, we consider multiple possible poses (total 24 different initial poses) for the scanned objects. We will transform scanned point clouds with each initial pose
[figure(s) omitted; refer to PDF]
3.3.3. CAD Model Retrieval
By using the aforementioned method, the scanned point cloud and the model point cloud can be matched if they are similar. However, in the absence of prior knowledge, it is not known whether they are similar or not. Therefore, it is necessary to retrieve the model that is most similar to the scanned point cloud and then estimate their relative poses. To achieve this goal, we align the scanned point cloud with each model point cloud and measure their alignment fitness. Here, we define the fitness using the inlier rate:
4. Experimental Study
We evaluate the performance of our pose estimation framework. Here, considering robotics perception and grasping tasks, we mainly experiment with axis symmetric objects such as bottles and with an example of asymmetric shape objects such as cups. In this section, we first introduced the dataset we have used, including virtual point clouds of objects obtained from CAD model library and also how we have obtained the actual measured point cloud of sensed object using RGB-D sensing. We showed experimental results about object detection and segmentation. Finally, we demonstrated the effectiveness of our pose estimation pipeline. Moreover, we compared the performance of three alignment methods and discussed their advantages and shortcomings.
4.1. Dataset
To obtain the CAD models of bottles and cups, we used the ModelNet40 [42] dataset, which contains 435 CAD models of bottle and 99 CAD models of cup. We randomly selected 15 CAD models for each category to build the CAD library, and some examples are shown in Figure 3. To obtain the scan objects, we put some cups and bottles with various shapes on a desk. Then, we use an Intel RealSense D435i RGB-Depth sensor (as shown in Figure 4) to simultaneously capture color and depth information of these objects.
[figure(s) omitted; refer to PDF]
4.2. 2D Object Detection and 3D Point Cloud Generation
To detect the objects of the scene, we use YOLOv8 on the captured RGB images. Figure 5 shows the detected results with bottles and a cup. From that we can see that YOLO can successfully detect the bounding boxes and segmentation masks. The mask is better than the bounding box when used to generate point cloud of the object because the mask contain very few information of background. Unfortunately, the depth images have shadow regions where the depth values are 0 (the blue regions in the second column). This may be caused by the illumination. Moreover, we can also observe that the generated point clouds have some outliers and some regions are distorted. These challenge the following alignment step.
[figure(s) omitted; refer to PDF]
4.3. Scan-to-Model Point Cloud Alignment
Here, we presented the experimental results of the scan-to-model alignment, that is, aligning the scanned point clouds to the virtual point clouds of CAD models. Objects with different shapes usually have different scales, leading to scale differences between CAD models and scanned objects. Here, we first compare the performance of three alignment algorithms, ICP, SICP, and CPD to address this problem (in Section 4.3.1). Secondly, we introduced the alignment results with different initial poses to demonstrate the effectiveness of our multipose initialization scheme (in Section 4.3.2). Moreover, the performance of the SICP algorithm and the CPD algorithm under different initial poses is discussed. Thirdly, we presented the experimental results with different models to show that our method can retrieve the best alignment CAD model (in Section 4.3.3). Finally, we reported the alignment results with different objects, demonstrating the generalization ability of our pose estimation pipeline (in Section 4.3.4).
4.3.1. Alignment With Scale Differences
We evaluated the performance of different algorithms, including ICP, SICP, and CPD, to align the scanned point clouds to virtual point clouds sampled from CAD models. We first obtained the point cloud of the actual object (i.e., a cup) using RGB-D sensor and mentioned object detection method. Then, we observed all cups in the library and select the most similar CAD Model 6 to test the performance of different alignment algorithms. The experimental results are shown in Figure 6, from which we can observe that the ICP algorithm fails to align the scan data (red) to the model data (blue). While, both SICP and CPD succeeded to align them because they can adaptively estimate the scale factor.
[figure(s) omitted; refer to PDF]
To further study the ability of the ICP algorithm, we manually set some scale values (including 0.2, 0.1, and 0.05) and use them to rescale the size of the data model before alignments. The visualized alignment results are shown in Figure 7. From the figure, we can observe that the scale factor has a large impact on the results of the alignment, and the rotations and translations estimated at different scales vary considerably. It can also be seen that accurate alignment of point clouds is difficult to be achieved by manual setting, and this approach is inefficient. Therefore, we mainly compare the CPD and SICP algorithms that can adaptively estimate the scale difference in the following experiments.
[figure(s) omitted; refer to PDF]
4.3.2. Alignment Results Under Different Initial Poses
Since the alignment methods (including SICP and CPD algorithms) are sensitive to the initial pose between the object and models, we presented a multipose initialization scheme to overcome this problem. To demonstrate the effectiveness of this scheme, we carefully selected a model to guarantee the scan object can be matched by the model, and then the alignment experiments are conducted. As shown in Table 1, the overlap scores of different initial poses are reported. Moreover, we visualize the Top 3 best aligned alignment results with different initial poses in Figure 8.
Table 1
Top 3 initial poses for CPD and SICP in total 24 poses mentioned in Section 3.
| InitPose | CPD | SICP | InitPose | CPD | SICP | InitPose | CPD | SICP |
| 1 | 0.886 | 0.615 | 9 | 0.427 | 0.547 | 17 | 0.408 | 0.551 |
| 2 | 0.501 | 0.396 | 10 | 0.886 | 0.900 | 18 | 0.399 | 0.405 |
| 3 | 0.427 | 0.492 | 11 | 0.814 | 0.718 | 19 | 0.432 | 0.463 |
| 4 | 0.430 | 0.593 | 12 | 0.500 | 0.599 | 20 | 0.244 | 0.531 |
| 5 | 0.413 | 0.532 | 13 | 0.402 | 0.558 | 21 | 0.440 | 0.684 |
| 6 | 0.372 | 0.292 | 14 | 0.442 | 0.644 | 22 | 0.381 | 0.378 |
| 7 | 0.886 | 0.646 | 15 | 0.413 | 0.795 | 23 | 0.402 | 0.547 |
| 8 | 0.502 | 0.551 | 16 | 0.381 | 0.608 | 24 | 0.385 | 0.595 |
[figure(s) omitted; refer to PDF]
From Table 1 and Figure 8, we can observe that the results of different initial poses vary significantly. Because of our multipose initialization scheme, we can successfully estimate a good alignment, demonstrating its effectiveness to achieve robust alignment. Moreover, from the figure, we can see that the alignment results with highest overlap scores are indeed good alignments. It demonstrates the validity of our proposed overlap score measurement.
By comparing the performance of SICP and CPD algorithms, we can find that the CPD algorithm is more robust to the initial poses. Because the CPD algorithm successfully align the model to scan under three different initial poses, while the SICP algorithm only successfully align them under one initial pose. For the precision of registration, we can observe that the highest overlap score obtained by the CPD algorithm under different initial positions is 0.886, while the highest score for SICP is 0.9. Therefore, the registration precision of SICP is higher, and its error is smaller than that of CPD. To further explore whether CPD has a wider convergence range than the SICP algorithm, we further perform a comparison experiment. Specifically, we first align the scan data to the model object and then rotate the model with different degrees along different axes. The SICP and CPD algorithms are used to align the two point clouds, and alignment results are shown in Figure 9. We can see that CPD has a wider convergence range than the SICP algorithm. Furthermore, it can be seen that SICP often achieves higher overlap ratio scores under better initial poses, which once again indicates that SICP can obtain smaller registration errors compared to CPD.
[figure(s) omitted; refer to PDF]
Furthermore, we conduct experiment to compare their efficiency, and the results are shown in Figure 10. We can observe that SICP is much faster than CPD.
[figure(s) omitted; refer to PDF]
4.3.3. Alignment Results With Different Models
When aligning the scanned point cloud with the CAD model, there is a problem that the scanned objects usually have various shapes, and we do not know which CAD model should be used. The object detection network cannot output information about the shape of the objects. To address this problem, we can align the scanned data with all the models of the same category in the model library and then select the one with the largest overlap scores. We would like to demonstrate that we can find the most similar CAD model through this way. Note that, to obtain the best alignment results of each model data, we used the multipose initial scheme.
The alignment results between the object and different models are shown in Table 2. The overlap scores with different models vary significantly, indicating the diversity of model shapes. To more intuitively show the alignment results, we showed the results of Top 3 candidate models in Figure 11. We can observe that the models with the highest overlap scores indeed achieve successful alignments, demonstrating the validity of our evaluation criteria. Moreover, both the SICP and CPD algorithms show their strong ability of scale alignments.
Table 2
Overlap scores of registration results of different models.
| ID | CPD | SICP | ID | CPD | SICP | ID | CPD | SICP |
| 1 | 0.938 | 0.942 | 6 | 0.886 | 0.900 | 11 | 0.408 | 0.551 |
| 2 | 0.457 | 0.452 | 7 | 0.778 | 0.756 | 12 | 0.399 | 0.405 |
| 3 | 0.795 | 0.799 | 8 | 0.716 | 0.755 | 13 | 0.432 | 0.463 |
| 4 | 0.775 | 0.853 | 9 | 0.931 | 0.943 | 14 | 0.244 | 0.531 |
| 5 | 0.916 | 0.934 | 10 | 0.851 | 0.849 | 15 | 0.440 | 0.684 |
Note: Top 3 model candidates for CPD and SICP are highlighted in total 15 models mentioned in Section 3. Here, Model 6 is the example model we use to discuss the initial pose in Section 4.3.2.
[figure(s) omitted; refer to PDF]
4.3.4. Alignment Results With Occluded Objects and Scene Reconstruction
To show the generalization ability of our pipeline, we present some alignment results of different objects with various shapes in Figure 12. As we can see, the generated point clouds are severely partial due to occlusion and single-camera view. But, our pose estimation pipeline can handle this challenging case with occluded objects. The scaled alignment algorithm can match occluded cup to the CAD model precisely. For the occluded bottle, there is ambiguity of its length. Therefore, a CAD model that can best match the visible part of the bottle is selected and aligned. Finally, it achieves an accurate virtual reconstruction using clean complete CAD models (as shown in Figure 12(c)).
[figure(s) omitted; refer to PDF]
4.4. Limitations
This study encountered several limitations: (1) Inconsistency in point cloud density. A primary issue arises from the disparity in point cloud densities between scanned point clouds and the model point clouds. This inconsistency leads to reduced matching accuracy, particularly when the scanned point clouds are of lower density. In such cases, the lack of detailed structural information hampers precise alignment with model point clouds. (2) Efficiency in current pipeline. In our current pipeline, we rely on registration-based retrieval, wherein scanned point clouds are individually aligned with model point clouds to identify the most similar model. This approach is inefficient, especially as the size of the model library increases, leading to prolonged processing times.
5. Conclusions and Future Work
In this paper, we proposed an object pose estimation framework based on scaled alignment approaches and conducted experimental studies about two alignment approaches, i.e., SICP and CPD. In the object pose estimation framework, we first used an advanced object detection method, YOLO v8, to locate the objects of scenes. The object detection network can output bounding boxes, instance segmentation masks, and predicted class labels. Then, the point clouds of detected objects were generated using depth images, segmentation masks, and camera intrinsic parameters. After obtaining point clouds of objects, we aligned them to the CAD models for pose estimation. We conduct experimental studies to compare the performance of two scaled alignment approaches, SICP and CPD. I found that they are both sensitive to the relative pose between scanned point clouds and CAD models. To address this issue, we presented a multipose initialization scheme. It tended to rotate the scanned point clouds by 24 different transformations and then align the rotated point clouds to the CAD models. We designed an evaluation criterion to select the best matched result from these 24 different results. The scanned object can be iteratively aligned to each CAD model in the same category. Finally, the most similar CAD model can be retrieved and the relative pose can be estimated.
To address the limitations discussed in Section 4.4, future research directions include the following: (1) Multidensity registration methods. Addressing the issue of inconsistent densities, further research into multidensity registration methods is warranted. This would involve hierarchical downsampling of model point clouds to match the density of scanned clouds, facilitating more reliable registration. In addition, for scan point clouds with low density, exploring dense depth estimation methods based on images and sparse point clouds could generate denser point clouds, thereby enhancing registration accuracy. (2) Efficient model retrieval. To improve efficiency in model retrieval, future efforts could focus on designing methods for global structural feature extraction and matching. By extracting global feature descriptors from point clouds, more efficient retrieval processes can be developed, significantly reducing the time required for model identification.
Funding
This research was funded by the Discovery Grant through the Natural Sciences and Engineering Research Council of Canada and Simon Fraser University.
[1] C. Gümeli, A. Dai, M. Nießner, "Roca: Robust Cad Model Retrieval and Alignment Fom a Single Image," 2021. https://arxiv.org/abs/2111.01268
[2] W. Kuo, A. Angelova, T.-Y. Lin, A. Dai, "Patch2cad: Patchwise Embedding Learning for in-the-Wild Shape Retrieval From a Single Image," 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12 569-612 579, .
[3] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, C. Rother, "Learning 6d Object Pose Esti-Mation Using 3d Object Coordinates," Lecture Notes in Computer Science, pp. 536-551, DOI: 10.1007/978-3-319-10605-2_35, 2014.
[4] W. Kehl, F. Milletarì, F. Tombari, S. Ilic, N. Navab, "Deep Learning of Local Rgb-D Patches for 3d Object Detection and 6d Pose Estimation," Lecture Notes in Computer Science, pp. 205-220, DOI: 10.1007/978-3-319-46487-9_13, 2016.
[5] C. Wang, D. Xu, Y. Zhu, "Densefusion: 6d Object Pose Estimation by Iterative Dense Fusion," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3338-3347, .
[6] P. Wohlhart, V. Lepetit, "Learning Descriptors for Object Recognition and 3D Pose Estimation," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3109-3118, DOI: 10.1109/cvpr.2015.7298930, 2015.
[7] Y. Xiang, T. Schmidt, V. Narayanan, D. Fox, "Posecnn: A convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes," 2017. https://arxiv.org/abs/1711.00199
[8] Y. LeCun, Y. Bengio, "Convolutional Networks for Images, Speech, and Time Series," 1998.
[9] C. Qi, H. Su, K. Mo, L. J. Guibas, "Pointnet: Deep Learning on Point Sets for 3d Classification and Segmentation," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 77-85, .
[10] G. Jocher, A. Chaurasia, J. Qiu, "YOLO by Ultralytics, Version 8.0.0," 2023. https://github.com/ultralytics/ultralytics
[11] S. Du, N. Zheng, S. Ying, J. Wei, "Icp With Bounded Scale for Registration of M-D Point Sets," Multimedia and Expo, 2007 IEEE International Conference, pp. 1291-1294, DOI: 10.1109/icme.2007.4284894, 2007.
[12] A. Myronenko, X. B. Song, "Point Set Registration: Coherent Point Drift," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32 no. 12, pp. 2262-2275, DOI: 10.1109/TPAMI.2010.46, 2010.
[13] C. Qi, L. Yi, H. Su, L. J. Guibas, "Pointnet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space," Neural Information Processing Systems, 2017.
[14] W. Liu, D. Anguelov, D. Erhan, "Ssd: Single Shot Multibox Detector," European Conference on Computer Vision, .
[15] X. Cheng, J. Yu, "Retinanet With Difference Channel Attention and Adaptively Spatial Feature Fusion for Steel Surface Defect Detection," IEEE Transactions on Instrumentation and Measurement, vol. 70,DOI: 10.1109/tim.2020.3040485, 2021.
[16] J. Redmon, S. K. Divvala, R. B. Girshick, A. Farhadi, "You Only Look Once: Unified, Real-Time Object Detection," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779-788, DOI: 10.1109/cvpr.2016.91, 2016.
[17] J. Redmon, A. Farhadi, "Yolo9000: Better, Faster, Stronger," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517-6525, DOI: 10.1109/cvpr.2017.690, 2017.
[18] J. Redmon, A. Farhadi, "Yolov3: An Incremental Improvement," 2018. https://arxiv.org/abs/1804.02767
[19] G. Jocher, "YOLOv5 by Ultralytics, Version 7.0," 2020. https://github.com/ultralytics/yolov5
[20] C. Li, L. Li, H. Jiang, "Yolov6: A Single-Stage Object Detection Framework for Industrial Applications," 2022. https://arxiv.org/abs/2209.02976
[21] C.-Y. Wang, A. Bochkovskiy, H.-Y. M. Liao, "Yolov7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors," 2022. https://arxiv.org/abs/2207.02696
[22] S. Ren, K. He, R. B. Girshick, J. Sun, "Faster R-Cnn: Towards Real-Time Object Detection With Region Proposal Networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39 no. 6, pp. 1137-1149, DOI: 10.1109/tpami.2016.2577031, 2017.
[23] J. Dai, Y. Li, K. He, J. Sun, "R-Fcn: Object Detection via Region-Based Fully Convolutional Networks," Advances in Neural Information Processing Systems, vol. 29, 2016.
[24] K. He, G. Gkioxari, P. Dollár, R. B. Girshick, "Mask R-Cnn," IEEE International Conference on Computer Vision (ICCV), pp. 2980-2988, 2017.
[25] Z. Cai, N. Vasconcelos, "Cascade R-Cnn: High Quality Object Detection and Instance Segmentation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43 no. 5, pp. 1483-1498, DOI: 10.1109/tpami.2019.2956516, 2021.
[26] K. Chen, J. Pang, J. Wang, "Hybrid Task Cascade for Instance Segmentation," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4969-4978, .
[27] É. Marchand, H. Uchiyama, F. Spindler, "Pose Estimation for Augmented Reality: A Hands-on Survey," IEEE Transactions on Visualization and Computer Graphics, vol. 22 no. 12, pp. 2633-2651, DOI: 10.1109/tvcg.2015.2513408, 2016.
[28] M. A. Fischler, R. C. Bolles, "Random Sample Consensus: A Paradigm for Model Fitting With Applications to Image Analysis and Automated Cartography," Communications of the ACM, vol. 24 no. 6, pp. 381-395, DOI: 10.1145/358669.358692, 1981.
[29] A. Kendall, M. K. Grimes, R. Cipolla, "Posenet: A Convolutional Network for Real-Time 6-Dof Camera Relocalization," 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2938-2946, DOI: 10.1109/iccv.2015.336, 2015.
[30] S. Peng, X. Zhou, Y. Liu, H. Lin, Q.-X. Huang, H. Bao, "Pvnet: Pixel-Wise Voting Network For 6dof Object Pose Estimation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44 no. 6, pp. 3212-3223, DOI: 10.1109/tpami.2020.3047388, 2022.
[31] X. Liu, G. Wang, Y. Li, X. Ji, "Catre: Iterative Point Clouds Alignment for Category-Level Object Pose Refinement," Lecture Notes in Computer Science, pp. 499-516, DOI: 10.1007/978-3-031-20086-1_29, 2022.
[32] P. J. Besl, N. D. McKay, "A Method for Registration of 3d Shapes," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 14 no. 2, pp. 239-256, DOI: 10.1109/34.121791, 1992.
[33] D. Chetverikov, D. Svirko, D. Stepanov, P. Krsek, "The Trimmed Iterative Closest Point Algorithm," Object recognition supported by user interaction for service robots, vol. 3, pp. 545-548, DOI: 10.1109/icpr.2002.1047997, 2002.
[34] A. V. Segal, D. Hähnel, S. Thrun, "Generalized-Icp," Robotics: Science and Systems, 2009.
[35] J. Yang, H. Li, D. Campbell, Y. Jia, "Go-Icp: A Globally Optimal Solution to 3d Icp Point-Set Registration," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38 no. 11, pp. 2241-2254, DOI: 10.1109/tpami.2015.2513405, 2016.
[36] A. L. Pavlov, G. V. Ovchinnikov, D. Y. Derbyshev, D. Tsetserukou, I. Oseledets, "Aa-Icp: Iterative Closest Point With Anderson Acceleration," IEEE International Conference on Robotics and Automation (ICRA), 2018.
[37] A. Avetisyan, M. Dahnert, A. Dai, M. Savva, A. X. Chang, M. NieBner, "Scan2cad: Learning Cad Model Alignment in Rgb-D Scans," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2609-2618, DOI: 10.1109/cvpr.2019.00272, 2019.
[38] A. Avetisyan, T. Khanova, C. B. Choy, D. Dash, A. Dai, M. Nießner, "Scenecad: Predicting Object Alignments and Layouts in Rgb-D Scans," Lecture Notes in Computer Science, pp. 596-612, DOI: 10.1007/978-3-030-58542-6_36, 2020.
[39] J. Hou, A. Dai, M. NieBner, "3d-Sis: 3d Semantic Instance Segmentation of Rgb-D Scans," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4416-4425, DOI: 10.1109/cvpr.2019.00455, 2019.
[40] C. R. Goodall, "Procrustes Methods in the Statistical Analysis of Shape," Journal of the Royal Statistical Society-Series B: Statistical Methodology, vol. 53 no. 2, pp. 285-321, DOI: 10.1111/j.2517-6161.1991.tb01825.x, 1991.
[41] R. I. Hartley, A. Zisserman, "Multiple View Geometry in Computer Vision," 2004.
[42] Z. Wu, S. Song, A. Khosla, "3d Shapenets: A Deep Representation for Volumetric Shapes," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1912-1920, .
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2025 Yiyang Dong et al. Journal of Engineering published by John Wiley & Sons Ltd. This is an open access article under the terms of the Creative Commons Attribution License (the “License”), which permits use, distribution and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/
Abstract
Pose estimation of objects is one of the main tasks for robotics to understand their environments and for monitoring tasks of grasping and manipulation of objects. In this paper, we present an experimental study of CAD-based object pose estimation to detect object locations and estimate their orientations using the prior defined models. Specifically, our study pipeline is developed for RGB-D sensors and consists of three steps. First, we incorporate an objection detection method using RGB images, which can result in the definition of the bounding boxes, instance masks, and class labels of detected objects with missing pose information. Then, we leverage the depth values of the masked pixels and known camera intrinsics to generate point clouds of objects. Finally, we align CAD models, defined in canonical poses, to the scan objects, achieving pose estimation and complete representation for the objects. Given that there may exist many challenges for such alignment task such as scale differences, partial overlap, noise, and outliers, we introduce two alignment approaches, namely, scale iterative closest point (SICP) and coherent point drift (CPD), and present a comprehensive experimental study of their accuracy, robustness, and computational efficiency. In particular, we observe that the methods are sensitive to the initial relative poses of objects. To address this problem, we introduce a multipose initialization scheme to improve their robustness. Our experimental results show that both methods can achieve accurate alignment; however, scale ICP (SICP) is time-efficient, while CPD is more robust to noise and occlusions. Our study demonstrates the feasibility of using RGB-D sensors, an object detection module, and point cloud alignment methods for accurate object detection and pose estimation.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer






