1. Introduction
Understanding 3D scenes is a fundamental necessity for various applications, including autonomous driving [1], robotics technology [2], medical image processing [3], and 3D reconstruction [4]. By enabling computers to achieve a deeper understanding of 3D scenes, 3D instance segmentation drives the advancement and application of intelligent systems across diverse domains. However, repetitive structures in indoor scenes (e.g., tables, chairs, and beds) present significant challenges for 3D instance segmentation. The similarity in geometric shapes and feature distributions of these objects often leads to instance confusion and crowding in the feature space, increasing the likelihood of missed detection and false positives. Moreover, the close arrangement of objects in indoor scenes, coupled with complex occlusion problems, make it difficult to accurately segment complete instances, particularly when edges are blurred, or objects are overlapped. The geometric and topological finiteness of repetitive structures further exacerbates segmentation difficulties, limiting the model’s ability to segment objects of similar shapes. Additionally, inconsistencies in data annotation introduce noise into the training data, potentially leading to deviations in the learning process and affecting generalization performance. When encountering unseen repetitive structures in the test set or a new environment, the model often exhibits poor adaptability.
In current research on 3D instance segmentation, deep learning methods are predominant, typically involving the training of neural networks for scene segmentation. Among these methods, backbone networks such as MLPs are commonly used for feature extraction. However, MLPs, while simple and easy to implement, often struggle to capture spatial information effectively during feature extraction and incur high computational complexity. As a result, networks relying solely on MLPs may not perform satisfactorily in the segmentation of complex scenes.
Recent researchers have proposed various methods to overcome the limitations of traditional multilayer perceptron (MLP) architectures for 3D instance segmentation. This has led to a growing interest in innovative approaches. Rozenberszki et al. [5] introduced a novel unsupervised method called UnScene3D. UnScene3D extracts color and geometric features from RGB-D data to generate pseudo-masks without relying on labeled data. They employed self-training to refine these masks, resulting in improved segmentation performance. Traditional bottom-up strategies often involve post-processing steps, which can be computationally expensive and require domain-specific assumptions. Kolodiazhnyi et al. proposed a top-down approach called TD3D [6], which is entirely data-driven. TD3D directly predicts instance proposals, filters them using non-maximum suppression, and performs mask segmentation for each proposal. The versatility of deep learning models for segmentation tasks has been a subject of study. While models can excel in specific tasks, they often struggle to generalize to others due to task-specific differences. Kolodiazhnyi et al. [7] proposed OneFormer3D, a unified framework for multitasking 3D segmentation, addressing this challenge through novel query selection and matching strategies. The Transformer architecture, which has shown remarkable success in 2D image processing, has been extended to 3D point clouds. Mask3D [8], proposed as a pioneering Transformer-based approach, directly predicts instance masks from 3D point clouds, eliminating the need for post-processing steps and improving computational efficiency.
A typical method for 3D instance segmentation, known as 3D-BoNet [9], adopts an anchor-free methodology and employs PointNet++ [10] as its backbone network. Leveraging both local and global features, 3D-BoNet directly predicts bounding boxes via the bounding box regression branch and establishing associations between ground truth and predictions using correlation layers. Furthermore, 3D-BoNet improves segmentation performance by predicting instance labels for points using the mask prediction branch. However, 3D-BoNet fails to capture complex features, neglecting the spatial relationships between points.
To address the limitation of 3D-BoNet, we propose a novel 3D instance segmentation framework tailored for indoor scenes. Building upon [9], our framework enhances the backbone network by replacing PointNet++ with the offset-attention mechanism [11] and submanifold sparse convolution [12].
By doing so, the network becomes adept at capturing intricate spatial relationships among points, thus improving its ability to perceive global information. Moreover, TSPconv-Net excels at extracting finer local geometric structures within complex scenes. Notably, our framework eliminates the need for post-processing steps such as clustering, non-maximum suppression [13], or voting. As a result, TSPconv-Net exhibits outstanding efficiency in both training and inference phases.
We conduct experiments on the S3DIS dataset [14], and experimental results indicate that our method can achieve outstanding segmentation performance for repetitive structural objects within complex scenes.
In summary, the contributions of our work are as follows:
We propose TSPconv-Net, a novel framework for 3D instance segmentation that leverages the OA mechanism and SSC to extract features in point clouds.
We optimize and improve the backbone network of 3D-BoNet, enhancing the ability of TSPconv-Net to capture spatial relationships within point clouds.
TSPconv-Net achieves outstanding performance on S3DIS without comprehensive modifications of the model architecture or hyperparameter tuning tailored for each dataset.
2. Related Work
Three-dimensional instance segmentation uses a semantic category label and a unique instance label to annotate each point in a 3D point cloud. Three-dimensional instance segmentation differs from 3D semantic segmentation; it not only requires the consideration of the semantic information of points but also necessitates differentiation between different instances under the same semantic information. Driven by applications in medical image processing, 3D reconstruction and 3D instance segmentation have gained significant interest [15]. Recent approaches can be broadly categorized into proposal-based and proposal-free methods.
proposal-based method. These methods adopt a two-stage approach to address the task: 3D object detection and instance label prediction. This strategy provides benefits such as potentially higher accuracy by initially concentrating on promising regions likely to contain objects. During the first stage, 3D object detection identifies potential objects in the scene. Subsequently, for each identified region of interest (ROI), the method predicts a specific label for each point within that region, effectively segmenting the objects. Hou et al. [16] proposed a method called 3D-SIS for RGB-D-based 3D semantic instance segmentation. This network utilizes a fully convolutional network to extract 2D color features and 3D geometric features, which are subsequently fused for 3D instance segmentation. Yi et al. [17] proposed a Generative Shape Proposal Network (GSPN) based on PointNet [18]. This network utilizes a generative proposal method to generate candidate point sets similar to the shape of the target points and encodes features to construct global and local structural information. Consequently, GSPN can effectively learn the shape features of the target, thereby improving segmentation accuracy. Yang et al. [9] proposed a single-stage, anchor-free and end-to-end trainable network called 3D-BoNet. The network takes a simpler approach, using point-wise MLP to directly predict 3D bounding boxes for each object and point-wise masks to refine the segmentation. Three-dimensional-BoNet also incorporates an optimal assignment method to handle data association effectively, balancing accuracy and speed compared to traditional methods.
In summary, proposal-based methods, exemplified by [19,20,21,22], demonstrate proficiency in handling objects with intricate shapes, achieving precise shapes and poses for each instance. Nonetheless, these methods often adopt multi-stage structures, leading to increased computational overhead and vulnerability to the choice of candidate region generation methods.
proposal-free method. These methods typically regard instance segmentation as a post-processing step of semantic segmentation. These methods operate under the assumption that points belonging to the same instance share similar features, thus emphasizing feature learning. Wang et al. [23] first proposed the deep learning model SGPN, which is founded on similarity measurement and a grouping strategy. SGPN uses a deep similarity matrix to measure the similarity between each pair of points in the point cloud. Additionally, it employs a novel grouping strategy to assign points in the point cloud to their respective instances. Similarly, Liu et al. [24] proposed the instance segmentation network MASC, which is based on SSC and point similarity. This network utilizes a multi-scale point affinity prediction method to measure the similarity between each pair of points in the point cloud. Furthermore, it reduces computational costs using sparse convolution. Given that semantic labels can be utilized for instance label prediction, some methods integrate these two tasks into a unified approach. Wang et al. [25] first proposed the ASIS framework, designed to simultaneously handle both instance segmentation and semantic segmentation tasks. ASIS employs two collaborative approaches: semantic-based instance segmentation and instance-based fusion semantic segmentation, which work in conjunction with each other. Similarly, Zhao et al. proposed JSNet [26] for accomplishing both semantic segmentation and instance segmentation.
Overall, proposal-free methods, such as [27,28,29,30,31,32], directly extract and classify point clouds, predicting instance labels, thereby offering lower computational costs and faster speeds. However, proposal-free methods encounter difficulties in handling objects with complex shapes and may struggle to achieve precise 3D shapes and poses for each instance.
3. Our Method
3.1. Overview
Our method consists of two primary components: a 3D backbone and a bounding box module. The backbone network takes a 3D point cloud as input to extract point cloud features . We adopt the OA mechanism and SSC as our backbone. The bounding box module consists of parallel network branches dedicated to bounding box regression and instance mask prediction. The box regression branch takes local features from the backbone and outputs bounding boxes , and the instance mask prediction branch takes global features and as inputs and transforms them into binary masks. The bounding box regression branch uses the association index A to associate ground truth bounding boxes with predicted bounding boxes and computes the cost matrix C. Additionally, we use the cost matrix to define the multi-criteria loss function. The overview of our method is illustrated in Figure 1.
3.2. Backbone Network
Compared to the MLP structure, attention mechanisms can capture not only distance information between points but also complex spatial information, enabling the network to better capture global dependencies and enhance focus on important features. However, attention mechanisms lead to increased network complexity and computational demands. Therefore, we introduce sparse submanifold convolution to reduce a large number of redundant computations, improving network training and inference speed. As a result, our backbone network can achieve excellent feature extraction efficiency.
3.2.1. Offset-Attention-Based Global Feature Extraction
Inspired by Guo et al. [11], we embed the input points P into a new feature space , obtaining embedded features . The structure of the global feature module is shown in Figure 2. To enable the network to learn richer and more comprehensive feature representations, we stack multiple attention modules in TSPconv-Net. This progressive stacking allows each layer to abstract and extract higher-level features from the output of the previous layer, thereby enhancing the network’s learning capacity and robustness. Additionally, the concatenation operation fuses features from different levels, preserving valuable information. The parameters of each layer are learnable, allowing the model to learn the optimal combination of features at different levels. This enables the model to capture more complex and nuanced feature representations. Moreover, fusing feature representations from different levels enhances the model’s robustness to noise and occlusion. However, directly stacking them can increase the complexity of the network, leading to gradient exploding or gradient vanishing. Therefore, we introduce residual modules to avoid such issues and improve the performance of TSPconv-Net. Then, we input into the stacked attention modules and obtain local features through the LBR layer. The LBR layer, composed of a linear layer, batch normalization layer, and ReLU activation function, enhances the expressive power and accelerates the convergence speed of TSPconv-Net. Finally, the local features are transformed into global features through a pooling layer. The specific calculation formula for attention features is shown in Equation (1).
(1)
where represents the i-th attention layer, each layer has the same dimension of output and input, and represents the weights calculated for the final linear layer.The embedding feature module consists of two cascaded LBRs. Through this combination, the embedding feature module efficiently extracts data features and enhances the fitting ability of our network to nonlinear data. The specific structure is depicted in Figure 3.
After obtaining embedded features , further feature extraction is performed using attention mechanisms. However, the original attention mechanism used in the Transformer [33] only aggregates adjacent information near nodes, resulting in corresponding adjacency matrices. Yet, the inherent information of point cloud data is equally crucial. Inspired by GCN [34], we replace the original adjacency matrices in the attention mechanism with Laplacian matrices, resulting in an improved attention mechanism known as the offset-attention mechanism. The specific structure of the offset-attention module is shown in Figure 4. Notably, solid lines indicate the self-attention computation process, while dotted lines indicate the offset-attention computation process. Adding the nonlinear transformation (LBR) of the offsets to the offsets themselves can enhance the ability to learn local features and preserve the original information to form a residual structure. This improves the ability of the model to capture complex geometric structures and represent features.
The specific processing of OA is shown as follows: Firstly, the output vector of the embedding feature module is used as the input eigenvector of the E dimension, and the query (Q), keyword (K), and value (V) matrices are obtained by linear transformation through the linear layer. The corresponding calculation formula is shown in Equation (2).
(2)
where , , represent the weight matrix of shared learnable linear transformation layers, and represents the dimensions of Q and K.After obtaining Q, K, and V, the query matrix and keyword matrix are used to calculate the attention weight according to Equation (3).
(3)
The attention matrix A is obtained by normalizing the attention weight from Equation (3) using the SoftMax operator and -norm, as shown in Equation (4).
(4)
The self-attention output features are obtained by using Equation (5). Then, the offsets of and are calculated. Finally, the offset transformed by the LBR layer is added to to obtain the output feature , and the offset of and is similar to the calculation of Laplacian matrix.
(5)
where L represents the Laplacian matrix.Offset attention uses element-wise offsets between input features and self-attention features to approximate the Laplacian operator, thereby enhancing network performance. Compared to self-attention in the original Transformer, offset attention offers stronger adaptability, higher robustness, and improved model performance.
To address potential issues such as gradient vanishing or exploding during training, we integrate a residual network structure into the offset-attention module. Specifically, we introduce ResNet [35], which employs a single residual module structure as illustrated in Figure 5.
In this structure, each residual module consists of a series of convolutional layers followed by a shortcut connection that directly adds the input of the module to its output. This design helps to mitigate the gradient vanishing problem by allowing gradients to flow more easily through the network during backpropagation. Additionally, the use of the shortcut connections facilitates the learning of identity mappings, enabling the network to learn residual functions effectively.
The network performs both max- and average-pooling operations on the extracted features independently. Max-pooling captures the most salient features by selecting the maximum value within each pooling region, while average pooling smoothes the feature map by averaging values. These pooled features are then merged to form the final global feature , which encapsulates both the detailed and averaged representations of the input data. This approach enhances the ability of the model to generalize and improve performance across various tasks.
3.2.2. Sparse Submanifold Convolution-Based Local Feature Extraction
To achieve superior computational efficiency and reduce memory consumption compared to traditional voxelization methods, we employ sparse submanifold convolution for voxel aggregation. This method only performs convolution on relevant, non-empty voxels, significantly reducing unnecessary computations and memory footprint while effectively extracting local feature information from the point cloud data.
The processing steps of the local feature extraction module mainly involve voxelization of the original point cloud, local feature extraction based on sparse submanifold convolution, and devoxelization.
Voxelization of the original point cloud. To process point cloud data of different scales using convolution, we first voxelize the point cloud, then normalize the coordinates of the input PCD. All points are translated to a local coordinate system centered at the centroid. Then, all points are normalized to fit within a unit sphere by dividing by the maximum L2 norm max. The coordinates are linearly mapped (scaled and shifted) from [−1, 1] to [0, 1], completing coordinate normalization. The normalized coordinates are represented as , where represents the voxel resolution. After obtaining the normalized coordinate information, the coordinates and voxel resolution r are used to calculate the voxel index for each point. For points falling into a voxel, the features corresponding to the coordinates undergo average pooling, with the average feature becoming the current voxel feature. The normalized point cloud is transformed into a voxel grid . The specific calculation formula is shown in Equation (6).
(6)
where denotes the feature value of the c-th channel of the voxel grid at position , r represents the voxel resolution, represents a binary indicator of whether the coordinate belongs to the voxel grid (u, v, w), represents the c-th channel feature associated with , and represents the normalization factor.Submanifold sparse convolution. To implement the convolution operation, we utilize a hash table and a matrix, each row in the matrix represents an active point, and the hash table contains tuples of all active points along with their positions in the matrix rows. Additionally, we define a rule book as a collection , which consists of integer matrices, where f represents the size of the filter, and denotes the spatial size of the convolutional filter. We reuse the input hash table for output and construct an appropriate rule book to implement sparse submanifold convolution. The cost of building the hash table and the rule book is , and it is independent of the depth of the network, where l is the number of active points in the input layer.
Devoxelization. To map voxel features to their corresponding points in the point cloud, we employ trilinear interpolation [36] for upsampling. This process essentially recovers higher resolution features by considering the eight nearest neighboring points in the local feature space and interpolating their values along all three dimensions. For all points requiring interpolation, the steps outlined above are repeated. The voxelized local features are then mapped back to the original point cloud space, completing the process of voxelization and obtaining of the original point cloud.
3.3. Bounding Box Processing Module
3D-BoNet [9] consists of a backbone network and two parallel network branches, which perform bounding box regression and per-point mask prediction, respectively. This network structure is simple and easy to train and deploy. Compared to anchor-based methods, 3D-BoNet does not require a complex and time-consuming process for generating candidate bounding boxes, resulting in higher efficiency. Compared to anchor-free methods, 3D-BoNet explicitly predicts the bounding boxes of targets, resulting in instances with better precision. Therefore, we adopt the bounding box regression branch and point mask prediction branch from 3D-BoNet as the bounding box processing module in TSPconv-Net.
3.3.1. Bounding Box Regression Branch
The bounding box regression branch uses the global features extracted from the backbone as the input to the branch and directly regresses the bounding box B and the corresponding bounding box score . By utilizing the predicted bounding box B and the ground truth bounding box through the bounding box association layer, a loss function is constructed, and the overall network is optimized by minimizing this loss function.
Bounding box encoding. Unlike defining bounding boxes using the center position and three-dimensional lengths, this module represents the bounding box using two sets of parameterized maximum–minimum vertices: .
Neural layer. This consists of two fully connected layers that use the Leaky ReLU function as the nonlinear activation function and two parallel fully connected layers, forming two branches that output different results. One layer outputs a dimensional vector and reconstructs it into an tensor, where H represents the maximum number of bounding boxes the network can predict, and represents the predicted bounding box. The other layer calculates the corresponding bounding box score using the sigmoid function. The bounding box score is positively correlated with the likelihood of an object being contained within the bounding box.
Bounding box association layer. This part associates the predicted bounding boxes B with the ground truth bounding boxes , where T is the number of ground truth bounding boxes. Finally, the overall loss function of the network is defined based on the cost between B and . Since the number of predicted bounding boxes usually differs from the number of ground truth bounding boxes (assuming ), the network is required to associate a unique predicted bounding box for each ground truth bounding box to achieve optimal allocation. Assume C is the cost matrix between predicted and ground truth bounding boxes, where represents the similarity between the two bounding boxes, with smaller costs indicating higher similarity. Therefore, the bounding box association layer is tasked with determining the allocation matrix A that minimizes the overall cost. The specific calculation is shown as Equation (7).
(7)
where A is a Boolean association matrix, and indicates whether the i-th predicted bounding box is assigned to the j-th ground truth bounding box. Each column of represents the association probabilities between the j-th ground truth bounding box and all predicted bounding boxes, so the sum of the elements in each column is 1. Moreover, since each predicted bounding box is only associated with nearby ground truth boxes, the sum of the elements in each row does not exceed 1.Due to the sparsity and uneven distribution of 3D point clouds, it is not feasible to directly use the Euclidean distance between the predicted bounding box and the ground truth bounding box as a basis for judgment. As shown in Figure 6, predicted bounding box has more valid points compared to predicted bounding box and a greater overlap with the ground truth bounding box . Therefore, the ground truth bounding box should be associated with predicted bounding box . As a result, the bounding box regression branch not only considers the Euclidean distance between vertices and the soft intersection over union but also incorporates the cross-entropy score into the cost matrix calculation. To solve this optimal allocation problem, the bounding box regression branch uses the Hungarian algorithm [37,38] to compute the optimal solution.
The costs of the three components are calculated as follows:
(1) Euclidean Distance. Defined as the Euclidean distance between the i-th predicted bounding box and the j-th ground truth bounding box . It is the average of the squared differences between the vertices. The detailed computation is presented in Equation (8).
(8)
(2) Soft Intersection over Union (sIoU). Defined as the sIoU between the predicted bounding box and the ground truth bounding box. The specific calculation is shown as Equation (9).
(9)
where represents the probability of the n-th point being inside the i-th predicted bounding box, and represents the probability of the n-th point being inside the j-th ground truth bounding box. Probabilities are calculated using Algorithm 1.Algorithm 1 Calculate the probability of points in the input point cloud P being inside the predicted bounding box B. Where H represents the number of predicted bounding boxes, N represents the number of points in the input point cloud, and are hyperparameters set to 100 and 20, respectively, for numerical stability. |
|
(3) Cross-Entropy Score. Defined as the cross-entropy score between and , indicating the confidence that the predicted bounding box contains as many effective points as possible. The calculation is specifically shown in Equation (10).
(10)
The bounding box prediction multi-criterion loss is computed by the bounding box association layer, which determines the optimal predicted bounding box for each ground truth bounding box by minimizing the cost. The loss function consists of three components: (1) Euclidean distance, (2) sloU, and (3) cross-entropy, as formally expressed in Equation (11):
(11)
The bounding box prediction score loss evaluates the validity of the predicted bounding boxes. After reordering by the association matrix A, the ground truth values for the first T scores are “1” and “0” for the remaining scores. The loss function is defined using cross-entropy as Equation (12).
(12)
where represents the score of the t-th associated predicted bounding box.3.3.2. Point Mask Prediction Branch
The point mask prediction branch feeds the global features and local features into the neural layer to refine the predicted bounding boxes B, as shown in Figure 7, thereby enhancing the segmentation performance of the predicted bounding boxes. Moreover, focal loss with default hyperparameters is used to optimize the point mask prediction branch.
Neural layers. The global and local features are compressed into 256-dimensional vectors through fully connected layers and further compressed into 128-dimensional fused features . For the i-th predicted bounding box , the predicted vertices are fused with the fused features to obtain the perception features . Finally, the point mask is obtained through shared fully connected layers.
The loss function of the bounding box processing network consists of the loss functions from both the bounding box regression branch and the point mask prediction branch. TSPconv-Net optimizes the network through this loss function to improve network performance and segmentation effectiveness. The loss function of TSPconv-Net is defined as Equation (13).
(13)
where denotes the standard softmax cross-entropy loss function used for learning the semantic information of each point. represents the focal loss [39] with default hyperparameters, used for optimizing the point mask prediction branch.4. Experiments and Analysis
We primarily utilize the publicly available dataset S3DIS [14] provided by Stanford University. This dataset comprises 6 educational and office areas, totaling 272 rooms, with over 215 million points in point clouds. These rooms encompass 11 different scenes, such as offices, conference rooms, and auditoriums. We utilize Python and PyTorch 1.10 to construct TSPconv-Net and conducted training on a single GTX 1660Ti. TSPconv-Net was trained using mini-batches of size 32, ensuring a manageable computational load while maintaining gradient stability. The training process was conducted over 250 epochs, allowing the network sufficient time to converge and optimize performance. The initial learning rate was set to 0.01, enabling efficient exploration of the parameter space in the early stages of training, with adjustments made as necessary based on the learning rate scheduler to fine-tune model performance as training progressed.
The dataset comprises 13 semantic categories, with each point in each scene labeled as 1 of these 13 semantic categories, including ceiling, floor, wall, beam, column, window, door, table, chair, sofa, bookcase, board, and other objects. We select Area1, Area3, Area4, Area5, and Area6 as the training set, while Area2 served as the test set.
4.1. Visualization Results
To more clearly demonstrate the optimization of object boundary extraction in our method, we select four representative scenes for visualization. As shown in Figure 8, we can clearly observe that, with the aid of bounding box prediction, our method excels in accurately extracting and distinguishing closely connected objects in the scene.
In these scenarios, traditional segmentation methods often suffer from blurred boundaries or overlapping objects, especially when the objects are densely packed or have complex shapes. However, by introducing bounding box prediction, our method can create clearer boundaries between objects, effectively reducing mis-segmentation.
For instance, in the classroom scene, our method precisely distinguishes and extracts the densely arranged chairs and tables. The bounding box helps to better define the spatial boundaries of the objects, avoiding confusion between chairs or between chairs and tables. Similarly, in conference room scenes, bounding box prediction effectively separates closely positioned chairs and tables, ensuring that each object is accurately recognized and segmented. This approach not only enhances segmentation performance in complex scenes but also significantly improves the model’s ability to handle details, making the boundaries of each object more distinct.
Through this optimization, our method has significantly improved the extraction of individual objects, especially in scenarios where objects are densely arranged or have complex shapes. Compared to traditional segmentation methods, the bounding box prediction technique effectively reduces mis-segmentation, making the final segmentation results more visually clear and accurate. This further proves the superiority of our method in multi-object scenes.
4.2. Comparison Results Analysis
We compare the TSPconv-Net with the SGPN [23], ASIS [25], and 3D-BoNet [12]. The comparison results are presented in Table 1. For quantitative evaluation of the data results, we primarily assessed the mean precision (mPrec), mean recall (mRec), average precision (AP), and mean average precision (mAP). From the observations, it can be concluded that our method outperforms the others in terms of mPrec and mRec. Compared to SGPN, our network shows an improvement of 30.4% and 21% in mPrec and mRec, respectively. Compared to the ASIS network, our method exhibits an improvement of 5% and 4.7% in mPrec and mRec, respectively. Compared to the 3D-BoNet [12], our method demonstrates an improvement of 3% and 4.6% in mPrec and mRec, respectively.
Our research demonstrates that proposal-based techniques, exemplified by 3D-BoNet, outperform direct feature-based methods in repetitive object extraction, achieving superior precision. This advantage likely arises from their ability to generate candidate regions likely to contain objects, focusing analysis on these key areas. Furthermore, TSPconv-Net achieves even higher precision than 3D-BoNet by leveraging offset attention (OA) and submanifold convolution (SSC) as the backbone for feature extraction, thus obtaining more effective features. To objectively assess segmentation performance, we employ the mean average precision (mAP), which is a standard metric for evaluating object detection accuracy. As shown in Table 1, our method achieves a mAP of 60.1%, demonstrating a significant improvement of 16.5% compared to SGPN and outperforming ASIS by 4.8%. Additionally, it outperforms the previous version of 3D-BoNet by 2.6%. Finally, we evaluate the time efficiency of our method against existing approaches (Table 1). Due to the introduction of SSC in our network, reducing a significant amount of unnecessary spatial convolution operations, TSPconv-Net demonstrates exceptional efficiency, processing data nearly 26 times faster than SGPN. It also exhibits faster processing compared to ASIS.
In summary, our method demonstrates excellent results in terms of average precision, average recall, mean average precision, and computation time. This indicates that using predicted bounding boxes for single-object extraction can obtain higher precision in individual object extraction.
5. Conclusions
In this paper, we propose an efficient 3D instance segmentation framework called TSPconv-Net. We introduce OA and SSC as the backbone network of TSPconv-Net, greatly enhancing the ability of TSPconv-Net to capture and represent deep features of point clouds. We feed these features into the bounding box processing module based on 3D-BoNet to obtain segmentation results. TSPconv-Net achieves better performance and comparable speed to 3D-BoNet on the S3DIS dataset. This further confirms the effectiveness of using OA and SSC for extracting point cloud features.
However, due to the introduction of the offset-attention mechanism to handle global features of the point cloud, TSPconv-Net undoubtedly increases the computational complexity of the network. Although TSPconv-Net avoids a large amount of meaningless computation through submanifold sparse convolution, its processing speed is still not ideal compared to the original 3D-BoNet. This issue warrants further investigation in future research, aiming to find ways to enhance the network’s efficiency.
Conceptualization, X.N.; methodology, X.N and Z.L.; software, Z.L.; validation, Y.M. and Y.W.; formal analysis, Z.S.; investigation, Y.W.; resources, H.J. and Z.S.; data curation, Y.M.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L.; visualization, Z.L.; supervision, X.N.; project administration, Y.W.; funding acquisition, X.N. All authors have read and agreed to the published version of the manuscript.
The raw data supporting the conclusions of this article will be made available by the authors on request.
The authors declare no conflicts of interest.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Figure 8. The segmentation results of TSPconv-Net across various scenes. Different colors represent distinct instances, and instances of the same type may be depicted in different colors.
Results of TSpconv-Net on the S3DIS Dataset.
Methods | mPrec (%) | mRec (%) | mAP (%) | Total Time (s) |
---|---|---|---|---|
SGPN [ | 38.2 | 31.2 | 43.6 | 8821.3 |
ASIS [ | 63.6 | 47.5 | 55.3 | 402.3 |
3D-BoNet [ | 65.6 | 47.6 | 57.5 | 294.1 |
TSPconv-Net | 68.6 | 52.2 | 60.1 | 326.1 |
References
1. Badue, C.; Guidolini, R.; Carneiro, R.V.; Azevedo, P.; Cardoso, V.B.; Forechi, A.; Jesus, L.; Berriel, R.; Paixao, T.M.; Mutz, F. et al. Self-driving cars: A survey. Expert Syst. Appl.; 2021; 165, 113816. [DOI: https://dx.doi.org/10.1016/j.eswa.2020.113816]
2. Billard, A.; Kragic, D. Trends and challenges in robot manipulation. Science; 2019; 364, eaat8414. [DOI: https://dx.doi.org/10.1126/science.aat8414] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31221831]
3. Shen, D.; Wu, G.; Suk, H.I. Deep learning in medical image analysis. Annu. Rev. Biomed. Eng.; 2017; 19, pp. 221-248. [DOI: https://dx.doi.org/10.1146/annurev-bioeng-071516-044442] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28301734]
4. Han, X.F.; Laga, H.; Bennamoun, M. Image-based 3D object reconstruction: State-of-the-art and trends in the deep learning era. IEEE Trans. Pattern Anal. Mach. Intell.; 2019; 43, pp. 1578-1604. [DOI: https://dx.doi.org/10.1109/TPAMI.2019.2954885] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31751229]
5. Rozenberszki, D.; Litany, O.; Dai, A. Unscene3D: Unsupervised 3D instance segmentation for indoor scenes. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA, 17–21 June 2024; pp. 19957-19967.
6. Kolodiazhnyi, M.; Vorontsova, A.; Konushin, A.; Rukhovich, D. Top-down beats bottom-up in 3D instance segmentation. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; Waikoloa, HI, USA, 4–8 January 2024; pp. 3566-3574.
7. Kolodiazhnyi, M.; Vorontsova, A.; Konushin, A.; Rukhovich, D. Oneformer3D: One transformer for unified point cloud segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA, 17–21 June 2024; pp. 20943-20953.
8. Schult, J.; Engelmann, F.; Hermans, A.; Litany, O.; Tang, S.; Leibe, B. Mask3D: Mask transformer for 3D semantic instance segmentation. Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA); London, UK, 29 May–2 June 2023; pp. 8216-8223. [DOI: https://dx.doi.org/10.1109/ICRA48891.2023.10160590]
9. Yang, B.; Wang, J.; Clark, R.; Hu, Q.; Wang, S.; Markham, A.; Trigoni, N. Learning object bounding boxes for 3D instance segmentation on point clouds. Adv. Neural Inf. Process. Syst.; 2019; 32, pp. 6737-6746.
10. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst.; 2017; 30, [DOI: https://dx.doi.org/10.48550/arXiv.1706.02413]
11. Guo, M.H.; Cai, J.X.; Liu, Z.N.; Mu, T.J.; Martin, R.R.; Hu, S.M. Pct: Point cloud transformer. Comput. Vis. Media; 2021; 7, pp. 187-199. [DOI: https://dx.doi.org/10.1007/s41095-021-0229-5]
12. Graham, B.; Engelcke, M.; Van Der Maaten, L. 3D semantic segmentation with submanifold sparse convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–23 June 2018; pp. 9224-9232.
13. Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS—Improving object detection with one line of code. Proceedings of the IEEE International Conference on Computer Vision; Venice, Italy, 22–29 October 2017; pp. 5561-5569.
14. Armeni, I.; Sener, O.; Zamir, A.R.; Jiang, H.; Brilakis, I.; Fischer, M.; Savarese, S. 3D semantic parsing of large-scale indoor spaces. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA, 27–30 June 2016; pp. 1534-1543.
15. Siddiqui, M.Y.; Ahn, H. Deep learning-based 3D instance and semantic segmentation: A review. J. Artif. Intell.; 2022; 4, 99. [DOI: https://dx.doi.org/10.32604/jai.2022.031235]
16. Hou, J.; Dai, A.; Nießner, M. 3D-sis: 3D semantic instance segmentation of rgb-d scans. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA, 15–20 June 2019; pp. 4421-4430.
17. Yi, L.; Zhao, W.; Wang, H.; Sung, M.; Guibas, L.J. Gspn: Generative shape proposal network for 3D instance segmentation in point cloud. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA, 15–20 June 2019; pp. 3947-3956.
18. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3D classification and segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July 2017; pp. 652-660.
19. Narita, G.; Seno, T.; Ishikawa, T.; Kaji, Y. Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); Macau, China, 3–8 November 2019; pp. 4205-4212.
20. Zhang, F.; Guan, C.; Fang, J.; Bai, S.; Yang, R.; Torr, P.H.; Prisacariu, V. Instance segmentation of lidar point clouds. Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA); Paris, France, 31 May–31 August 2020; pp. 9448-9455. [DOI: https://dx.doi.org/10.1109/ICRA40945.2020.9196622]
21. Liu, S.H.; Yu, S.Y.; Wu, S.C.; Chen, H.T.; Liu, T.L. Learning gaussian instance segmentation in point clouds. arXiv; 2020; arXiv: 2007.09860
22. Engelmann, F.; Bokeloh, M.; Fathi, A.; Leibe, B.; Nießner, M. 3D-mpa: Multi-proposal aggregation for 3D semantic instance segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA, 13–19 June 2020; pp. 9031-9040.
23. Wang, W.; Yu, R.; Huang, Q.; Neumann, U. Sgpn: Similarity group proposal network for 3D point cloud instance segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–23 June 2018; pp. 2569-2578.
24. Liu, C.; Furukawa, Y. Masc: Multi-scale affinity with sparse convolution for 3D instance segmentation. arXiv; 2019; arXiv: 1902.04478
25. Wang, X.; Liu, S.; Shen, X.; Shen, C.; Jia, J. Associatively segmenting instances and semantics in point clouds. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Long Beach, CA, USA, 15–20 June 2019; pp. 4096-4105.
26. Zhao, L.; Tao, W. Jsnet: Joint instance and semantic segmentation of 3D point clouds. Proceedings of the the AAAI Conference on Artificial Intelligence; New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12951-12958. [DOI: https://dx.doi.org/10.1609/aaai.v34i07.6994]
27. Chen, S.; Fang, J.; Zhang, Q.; Liu, W.; Wang, X. Hierarchical aggregation for 3D instance segmentation. Proceedings of the IEEE/CVF International Conference on Computer Vision; Montreal, QC, Canada, 10–17 October 2021; pp. 15467-15476.
28. He, T.; Shen, C.; Van Den Hengel, A. Dyco3D: Robust instance segmentation of 3D point clouds through dynamic convolution. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA, 20–25 June 2021; pp. 354-363.
29. He, T.; Yin, W.; Shen, C.; Van den Hengel, A. Pointinst3D: Segmenting 3D instances by points. Proceedings of the European Conference on Computer Vision; Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 286-302.
30. Vu, T.; Kim, K.; Luu, T.M.; Nguyen, T.; Yoo, C.D. Softgroup for 3D instance segmentation on point clouds. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; New Orleans, LA, USA, 18–24 June 2022; pp. 2708-2717.
31. Wu, Y.; Shi, M.; Du, S.; Lu, H.; Cao, Z.; Zhong, W. 3D instances as 1d kernels. Proceedings of the European Conference on Computer Vision; Tel Aviv, Israel, 23–27 October 2022; pp. 235-252.
32. Zhao, W.; Yan, Y.; Yang, C.; Ye, J.; Yang, X.; Huang, K. Divide and conquer: 3D point cloud instance segmentation with point-wise binarization. Proceedings of the IEEE/CVF International Conference on Computer Vision; Paris, France, 2–3 October 2023; pp. 562-571.
33. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst.; 2017; 30, 15. [DOI: https://dx.doi.org/10.48550/arXiv.1706.03762]
34. Shi, X.; Chai, X.; Xie, J.; Sun, T. Mc-gcn: A multi-scale contrastive graph convolutional network for unconstrained face recognition with image sets. IEEE Trans. Image Process.; 2022; 31, pp. 3046-3055. [DOI: https://dx.doi.org/10.1109/TIP.2022.3163851] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/35385383]
35. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Las Vegas, NV, USA, 27–30 June 2016; pp. 770-778.
36. Fadnavis, S. Image interpolation techniques in digital image processing: An overview. Int. J. Eng. Res. Appl.; 2014; 4, pp. 70-73.
37. Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q.; 1955; 2, pp. 83-97. [DOI: https://dx.doi.org/10.1002/nav.3800020109]
38. Kuhn, H.W. Variants of the Hungarian method for assignment problems. Nav. Res. Logist. Q.; 1956; 3, pp. 253-258. [DOI: https://dx.doi.org/10.1002/nav.3800030404]
39. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision; Venice, Italy, 22–29 October 2017; pp. 2980-2988.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Current deep learning approaches for indoor 3D instance segmentation often rely on multilayer perceptrons (MLPs) for feature extraction. However, MLPs struggle to effectively capture the complex spatial relationships inherent in 3D scene data. To address this issue, we propose a novel and efficient framework for 3D instance segmentation called TSPconv-Net. In contrast to existing methods that primarily depend on MLPs for feature extraction, our framework integrates a more robust feature extraction model comprising the offset-attention (OA) mechanism and submanifold sparse convolution (SSC). The proposed framework is an end-to-end network architecture. TSPconv-Net consists of a backbone network followed by a bounding box module. Specifically, the backbone network utilizes the OA mechanism to extract global features and employs SSC for local feature extraction. The bounding box module then conducts instance segmentation based on the extracted features. Experimental results demonstrate that our approach outperforms existing work on the S3DIS dataset while maintaining computational efficiency. TSPconv-Net achieves 68.6% mPrec, 52.5% mRec, and 60.1% mAP on the test set, surpassing 3D-BoNet by 3.0% mPrec, 5.4% mRec, and 2.6% mAP. Furthermore, it demonstrates high efficiency, completing computations in just 326 s.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details





1 Institute of Computer Science and Engineering, Xi’an University of Technology, No. 5 South of Jinhua Road, Xi’an 710048, China;
2 Institute of Computer Science and Engineering, Xi’an University of Technology, No. 5 South of Jinhua Road, Xi’an 710048, China;
3 School of Artificial Intelligence and Computer Science, Jiangnan University, 1800 of Lihu Road, Wuxi 214122, China;