Content area
ABSTRACT
Computer‐aided design (CAD) serves as an essential and irreplaceable tool for engineers and designers, optimising design workflows and driving innovation across diverse industries. Nevertheless, mastering these sophisticated CAD programmes requires substantial training and expertise from practitioners. To address these challenges, this paper introduces a framework for reconstructing CAD models from multiview. Specifically, we present a novel end‐to‐end neural network capable of directly reconstructing parametric CAD command sequences from multiview. Subsequently, the proposed network addresses the low‐rank bottleneck inherent in traditional attention mechanisms of neural networks. Finally, we present a novel parametric CAD dataset that incorporates multiview for corresponding CAD sequences while eliminating redundant data. Comparative experiments reveal that the proposed framework effectively reconstructs high‐quality parametric CAD models, which are readily editable in collaborative CAD/CAM environments.
Introduction
Human creativity inherently compels us to conceive and materialise ideas, often manifested as 3D shapes, to deal with difficulties in many areas [1, 2]. As a result, the objects that define our everyday surroundings are meticulously crafted using computer-aided design (CAD), spanning diverse industries including automotive, aerospace, manufacturing and architectural design. The accessibility and utilisation of these CAD programmes for common objects could unlock vast potential for editing and repurposing, with significant implications for mechanical and industrial design fields.
Can machines effectively assist CAD practitioners in alleviating the repetitive and labour-intensive steps of the modelling process? With the rapid advancements in deep learning for 3D reconstruction, a growing number of researchers are actively pursuing innovations in this domain. However, current deep learning approaches remain predominantly constrained to discrete representations of 3D shapes, including point clouds [3], polygonal meshes [4–6] and voxels [7].
Therefore, how to enhance both the precision and efficiency of 3D modelling, potentially transforming the design of complex structures, is an important task in CAD/CAM applications [8–10]. The challenge mainly lies in two aspects: firstly, efficiently extracting features from 3D shapes, and secondly, using these features to precisely reconstruct CAD command sequences. Concurrently, in computer vision, a key question remains as follows: how should 3D shapes be best represented? Is it through their native 3D formats such as voxel grids, point clouds or polygon meshes or via view-based descriptors? Establishing a clearer connection between 3D shape features and CAD command sequences is still an open area of exploration. Existing CAD models are typically composed of point clouds and meshes. However, the real human vision is a 3D image synthesised in the brain from multiview images seen by the eyes. Therefore, studying the reconstruction of 3D models from multiviews is important. However, recent research on multiview CAD is basically focused on classification [11, 12], and at present, there is no multiview CAD dataset for reconstruction and generation tasks.
To address these challenges, we propose MultiviewCAD, a novel reconstruction network that generates a sequence of operations employed by CAD software, such as CATIA and Onshape, for crafting 3D shapes. This sequence, which constitutes the core of a CAD model, is integral to the ‘drawing’ phase of 3D shape creation. In industrial design, CAD models serve as a cornerstone, underpinning nearly all contemporary 3D designs. We first conduct a comprehensive comparison of networks with different 3D shape representations, including meshes, point clouds and voxels. Our analysis reveals that the multiview approach provides superior recognition capabilities, with its data being conveniently accessible, even through users' smartphones. Building on this finding, we propose a multiview-based 3D feature extractor designed to effectively capture the features of 3D shapes. To bridge the gap between 3D shape features and CAD command sequences, we draw inspiration from the audio-to-text task and design a transformer-based dual-decoder reconstructor. This reconstructor translates 3D features into CAD command sequences with high accuracy and efficiency. To train our reconstruction network, we also create a new multiview dataset paired with the DeepCAD dataset, encompassing CAD construction sequences and corresponding multiview data. Notably, we identify substantial redundancy in the original 179,133 models of the DeepCAD dataset. To mitigate the risk of model overfitting, we eliminate the redundant data. To promote further research in related fields, we intend to publicly release this dataset.
In summary, the contributions of this paper are as follows:
-
We first present a novel end-to-end network capable of directly reconstructing parametric CAD sequences from multiview images.
-
We propose a novel multiview approach for the CAD sequence dataset and eliminate redundant data from the DeepCAD dataset with the goal of mitigating the risk of network overfitting in future research endeavours.
-
The newly proposed network not only achieves state-of-the-art performance but also sets a robust benchmark for advancing future research in this domain.
Related Works
3D Shape Learning With Computer Vision
3D data has multiple popular representations, leading to a variety of learning approaches. These approaches can be divided into two main categories.
-
Directly based on the native 3D representations of objects, such as polygon meshes [4, 13], voxel-based discretisations [14] and point clouds [15–17].
-
View-based methods [18], which describe the shape of a 3D object through a series of 2D projections.
Native 3D representation methods tend to be high dimensional, making classifiers prone to overfitting due to the so-called curse of dimensionality. In contrast, view-based descriptors are relatively low dimensional and more efficient in evaluation [19].
Multiview learning [18, 20, 21] relies on a 2D image classification network to extract features from multiple perspectives, incorporating a view pooling layer into a VGG-m style network. This layer aggregates features through full-stride channelwise max pooling to obtain a unified feature vector, which is then processed by fully connected layers for category label prediction. However, there are no related CAD reconstruction applications from multiviews. Therefore, we construct a multiview encoder based on the MVCNN [21] architecture to extract features of 3D shapes for use in the downstream pipeline.
3D Construction Learning Based on Transformer
Transformer [22] is one of the most popular learning networks in recent years for representations of structured and sequential data, such as text [23] and other types of signals [24–26]. A key component of the transformer networks is the self-attention layer, which enhances token representation by leveraging statistical correlations among the token sequences.
For instance, SkexGen [27] leverages transformer to encode topological, geometric and extrusion variations of construction sequences into the disentangled codebooks. Subsequently, autoregressive decoders generate CAD construction sequences that share properties as specified by the vectors in the codebooks. DeepCAD [28] is the work most closely related to ours. Specifically, DeepCAD uses an autoencoder to encode a CAD model M into a latent vector z and trains the PointNet++ [15] encoder to encode the point cloud representation of M into the same vector z. However, multiview-based methods provide better generalisability and surpass other methods based on voxels or point clouds [19], leading us to believe that the feature extraction capability of our multiview encoder is also better than PointNet++.
CAD Programme Reconstruction
The task of 3D CAD reconstruction is focused on recovering a CAD programme, which is represented as a sequence of modelling operations, from various types of input, such as meshes, point clouds or multiview.
As an emerging research area for 3D learning [7, 23, 28, 29], CAD reconstruction is highly challenging due to the unique nature of CAD modelling sequences [30, 31]. Although somewhat comparable to natural language, they exhibit significant differences, particularly in the necessity for predicting both continuous and discrete parameters—an intricate challenge for neural networks. This dual requirement of managing diverse parameter types highlights the unique complexity of CAD reconstruction.
Willis et al. [32] propose a dataset containing 8625 CAD modelling sequences with sketch and extrude operations and present the sequential construction of a CAD programme as a Markov decision process, making it suitable for reinforcement learning methods. Mikaela et al. [3] propose a supervised network, Point2Cyl, for converting raw 3D point clouds into a set of extrusion cylinders and further demonstrate its application in reverse engineering and editing.
Lambourne et al. [7] employ editable, constrained, prismatic CAD models to approximate smooth signed distance functions, reconstructing the input geometry in voxel space. These reconstructions can then be recombined in a differentiable manner, allowing for the definition of a geometric loss function. Although their method is effective, it is limited by a fixed-size 2D sketch database.
However, there are no existing CAD reconstruction methods that use multiview images as input and command sequence, as output due to the lack of a multiview CAD dataset. Therefore, we propose the first multiview CAD reconstruction method for parametric CAD models.
The Proposed Method
The overall framework of the proposed MultiviewCAD network is illustrated in Figure 1.
[IMAGE OMITTED. SEE PDF]
Specification of CAD Commands
Large-scale CAD industrial software supports a vast command set, but in everyday practice, only a small portion of it is utilised. Therefore, we focus solely on the subset of frequently used commands. These commands mainly fall into two categories, namely, sketch and extrusion. Despite the limited number of command types, they prove sufficient for representing a wide array of shapes [32].
Typical sketch commands encompass line, arc and circle. Utilising these commands, users can generate a closed-loop sketch. Each command is equipped with parameters to regulate its shape. As for typical extrusion commands, they involve specifications such as height and direction. By employing extrusion commands, users can elevate a 2D sketch to obtain a 3D shape.
Multiview Encoder
We develop a multiview encoder based on image-based CNNs for processing 3D shapes. This encoder analyses N views of a given 3D shape, extracting features using a 2D image classification network, specifically VGG-11 [33]. This VGG-11 network is pretrained on ImageNet and fine-tuned on shuffled multiview 2D images of all training 3D shapes to improve network performance. After that, these features are combined in a view-pooling layer using an elementwise maximum operation, known for its simplicity and effectiveness. Finally, the combined features are processed through the classifier component of VGG-11, resulting in a comprehensive shape descriptor for each 3D shape.
Dual Decoder CAD Sequence Reconstructor
We now introduce our dual decoder reconstructor, which can decode the extracted 3D shape features into a sequence of CAD commands. This reconstructor is based on the transformer, inspired by its success in processing sequential data.
Fixed Multihead Attention
The superior performance of the transformer in processing sequential data is intricately linked to the functionality of its multihead self-attention layers. Although the use of multihead attention has shown better performance than single-head attention layers in various scenarios, suggesting that increasing the number of heads could endow the model with more expressive power, at the same time, we are actually reducing the size of each head, which could reduce the model's expressive power. This is because, in the multihead attention layer, each head independently computes the self-attention among the tokens in the input sequence, and then these are concatenated. When the number of heads is greater than (where is the length of the token sequence), the attention unit inside each head projects onto a dimension smaller than , thereby creating a low-rank bottleneck and losing the ability to represent arbitrary context vectors [34]. We remove the interdependence between the size of the projection within each head and the embedding size of the model. The projection matrices are now designed to map onto subspaces with a fixed dimension , which is not contingent on the number of heads . The formula for the new attention mechanism is as follows:
Preliminary Decoder
To reconstruct a valid CAD command sequence, we apply a constant positional encoding. We first broadcast the feature vector passed from the multiview encoder to , and then add the constant positional encoding to as shown below:
Our preliminary decoder comprises four layers from the transformer decoder architecture and relies on fixed multihead attention, which outputs the preliminary CAD command sequence feature .
Autoregressive Decoder
Our autoregressive decoder is constructed with four masked transformer decoders, sharing the same hyperparameter settings as the preliminary decoder. It accepts the output of the preliminary decoder with a target mask as its input. Our autoregressive decoder is capable of predicting the CAD command sequence in an autoregressive manner as follows:
MultiviewCAD Dataset
Currently, several large-scale 3D shape datasets are available, including the ABC dataset [35], ModelNet [36] and ShapeNet [37], which primarily consist of point clouds, meshes and boundary representations (B-reps). However, these models lack CAD construction information, being solely 3D geometry models.
A recent work, the DeepCAD dataset, tries to introduce construction information, offering 179,133 CAD models with construction sequences including contour sketches and extrusions. However, DeepCAD exhibits its limitations:
Firstly, the DeepCAD dataset contains a significant amount of redundant data. Therefore, there is a high probability that the same 3D model could appear in both the training dataset and the test dataset. Such overlap in the dataset may increase the risk of overfitting during the network learning process.
Secondly, DeepCAD primarily focuses on CAD command sequences and does not contain multiview data, limiting its suitability for view-based CAD reconstruction.
Therefore, in this paper, we propose a new dataset, the MultiviewCAD dataset, which has the following advantages:
On the one hand, our dataset incorporates a hash-based deduplication algorithm, detailed in Figure 2. We transform DeepCAD's JSON data into vector command data, eliminating duplicates via this algorithm. After deduplication checks, we find that among the 179,133 CAD models in DeepCAD, over 50,000 are duplicates. Our MultiviewCAD dataset cleans up these redundant models. Thus, our dataset can decrease the risk of overfitting during network learning.
[IMAGE OMITTED. SEE PDF]
On the other hand, we utilise the Phong reflection model to create rendered views of these sequences, involving a perspective projection to render mesh polygons. The colour of each pixel is calculated by interpolating the reflected intensity at the vertices of the polygons. To maintain consistency, shapes are uniformly scaled to fit within a predefined viewing volume. Establishing multiple viewpoints, similar to virtual cameras, is crucial for rendering each mesh and creating a comprehensive multiview representation.
Experiments
In this section, we assess our work from two aspects: the reconstruction of CAD sequences based on multiview and the demonstration of the advantage of our MultiviewCAD dataset.
Experiment Setup
Training Loss
The overall training loss function consists of command loss and parameter loss . To be specific, we set the loss function for quantifying the difference between the predicted CAD model and the ground truth model M as follows:
Training Progress
All our experiments are performed on a PC outfitted with an Intel i7 CPU and an NVIDIA GeForce RTX 3090 GPU. The number of views, N, is set to 12, and the feedforward dimension is fixed at 1024. The values for and are set to 60 and 16, respectively. The number of layers for both the preliminary decoder and the autoregressive decoder is set to 4. The network is further trained for 300 epochs with a learning rate of 0.0001 and a batch size of 32.
Metrics
To evaluate the performance of our proposed network more effectively, we assess the differences between and M from two key perspectives. Firstly, we examine the sequence dimension by comparing the ground-truth sequences with the predicted sequences. Secondly, we shift our focus to directly comparing the 3D shapes.
To accurately evaluate command accuracy, we use two metrics: and . is focused on assessing the correctness of the predicted CAD command type. is focused on assessing the correctness of the command parameters.
is defined as follows:
is defined as follows:
We use the (CD), commonly applied in the generative models of discretised shapes such as point clouds and meshes, to evaluate the quality of the 3D shapes reconstructed by our model. The equation of the is as follows:
Additionally, given that our model outputs CAD command sequences, it is not always certain that the reconstructed sequences will yield valid 3D shapes. Therefore, we have also established an invalid ratio to measure these problematic CAD sequences.
Comparison Experiments for MultiviewCAD Network
Among recent studies, the DeepCAD network is closely related to ours and stands out as the sole provider of source code. Consequently, we conducted comparative experiments between MultiviewCAD and DeepCAD, complemented by ablation experiments showcasing the performance of MultiviewCAD. The network methods under evaluation are as follows:
DC-P represents that R. Wu et al. [28] use two independent networks to reconstruct the CAD modelling sequence from point clouds of 3D objects. The specific method involves initially pretraining an autoencoder to express the target data distribution, then mapping the point cloud to a latent vector encoded by PointNet++ [15] and finally using the decoder of the autoencoder to decode the latent vector into a CAD modelling sequence.
Ours (norm) uses a standard transformer and does not make any improvements to its multihead attention mechanism.
Ours employs the MultiviewCAD network proposed in this paper, which consists of an efficient 3D feature extractor called a multiview encoder and a dual decoder with improved multihead attention.
Comparison
To ensure a fair comparison, all codes are executed in the same experimental environment. Quantitative experimental results are reported in Table 1. Under the MultiviewCAD dataset, compared to DeepCAD, MultiviewCAD can reconstruct more precise commands and parameters. Moreover, ablation studies also demonstrate that fixed multihead attention can better decode 3D discrete features than normal multihead attention.
TABLE 1 Quantitative evaluation of CAD reconstruction.
| Network | Median | Invalid | ||
| DC-P | 73.21 | 68.43 | 14.29 | 19.74 |
| Ours (norm) | 78.34 | 72.31 | 3.04 | 24.76 |
| Ours | 80.21 | 76.23 | 2.93 | 23.03 |
Comparison Experiments for MultiviewCAD Dataset
In this subsection, we further demonstrate the advantage of the MultiviewCAD dataset. We conduct comparative experiments between the DeepCAD dataset and the MultiviewCAD dataset. The experimental settings are as follows:
DC represents a transformer-based autoencoder for CAD models under the MultiviewCAD dataset. This experimental data is from Ref. [28], which mainly shows the autoencoding performance of the DeepCAD model.
DC (origin) uses the original DeepCAD dataset, with other conditions being the same as DC.
DC-P has been introduced in IV-C.
DC-P (origin) uses the original DeepCAD dataset, with other conditions being the same as DC-P.
Discussion of Results
The quantitative experimental results are shown in Table 2. Affected adversely by duplicate data, the performance of both the transformer-based autoencoder and the network for reconstructing CAD modelling sequences from point clouds has experienced varying degrees of decline under the MultiviewCAD dataset. We believe that the reason for this phenomenon is the overfitting of the model caused by duplicate data.
TABLE 2 Quantitative result.
| Experimental setting | Median | Invalid | ||
| DC (origin) | 98.73 | 98.16 | 0.493 | 2.72 |
| DC | 98.62 | 96.16 | 0.823 | 4.92 |
| DC-P (origin) | 85.95 | 74.22 | 10.30 | 12.08 |
| DC-P | 73.21 | 68.43 | 14.29 | 19.74 |
Visualisation and Editing of Reconstructed CAD Models
The qualitative results of the CAD reconstruction using MultiviewCAD are illustrated in the accompanying Figure 3. This visualisation of reconstructed CAD models demonstrates that the proposed MultiviewCAD network and dataset can effectively produce typical CAD models. Each model consists of a sequence of CAD operations, each with specific parameters.
[IMAGE OMITTED. SEE PDF]
Notably, the reconstructed 3D shapes are parametric, supporting further modifications in collaborative settings, as depicted in Figure 4. Hence, MultiviewCAD stands out as a valuable platform for intelligent CAD/CAM applications.
[IMAGE OMITTED. SEE PDF]
Conclusion and Future Works
In this paper, we present MultiviewCAD, a novel end-to-end network tailored for reconstruct CAD modelling sequences from multiple images. Specifically, it integrates a multiview encoder with a dual decoder based on fixed multihead attention to reconstructing CAD modelling sequences autoregressively. To our knowledge, this work is the first multiview reconstruction network for parametric CAD models. Furthermore, we propose a new parametric CAD dataset, the MultiviewCAD dataset, in which we eliminate duplicate data in the DeepCAD dataset and add multiview data that pairs with the CAD command sequence.
In order to promote research in this field, we will release both the MultiviewCAD network source code and the MultiviewCAD dataset. In the future, we will focus on tasks that include advanced features and introduce more advanced AI techniques [30, 31, 38, 39] to CAD and collaborative design areas.
Author Contributions
Rubin Fan: conceptualization, methodology, software, visualization, writing – original draft, writing – review and editing. Yi Zhang: software, writing – original draft. Fazhi He: conceptualization, funding acquisition, supervision, writing – review and editing.
Acknowledgements
The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University.
Conflicts of Interest
The authors declare no conflicts of interest.
Data Availability Statement
Data supporting the experiments in this study are available from the corresponding author upon proper request.
I. Bakkouri and K. Afdel, “Computer‐Aided Diagnosis (cad) System Based on Multi‐Layer Feature Fusion Network for Skin Lesion Recognition in Dermoscopy Images,” Multimedia Tools and Applications 79, no. 29 (2020): 20483–20518, https://doi.org/10.1007/s11042‐019‐07988‐1.
I. Bakkouri and K. Afdel, “Multi‐Scale cnn Based on Region Proposals for Efficient Breast Abnormality Recognition,” Multimedia Tools and Applications 78, no. 10 (May 2019): 12939–12960, https://doi.org/10.1007/s11042‐018‐6267‐z.
M. A. Uy, Y.‐Y. Chang, M. Sung, et al., “Point2cyl: Reverse Engineering 3d Objects From Point Clouds to Extrusion Cylinders,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), 11850–11860.
C. Lin, T. Fan, W. Wang, and M. Nießner, “Modeling 3d Shapes by Reinforcement Learning,” in European Conference on Computer Vision, (2020), 545–561.
L. Li, F. He, R. Fan, B. Fan, and Y. Xiaohu, “3d Reconstruction Based on Hierarchical Reinforcement Learning With Transferability,” Integrated Computer‐Aided Engineering 30, no. 4 (2023): 327–339, https://doi.org/10.3233/ica‐230710.
L. Fan, C. Wu, F. He, B. Fan, and Y. Liang, “Delaunay Meshes Simplification With Multi‐Objective Optimisation and Fine Tuning,” IET Collaborative Intelligent Manufacturing 5, no. 4 (2023): e12088, https://doi.org/10.1049/cim2.12088.
J. G. Lambourne, K. Willis, P. K. Jayaraman, L. Zhang, A. Sanghi, and K. R. Malekshan, “Reconstructing Editable Prismatic Cad From Rounded Voxel Models,” in SIGGRAPH Asia 2022 Conference Papers, (2022), 1–9.
I. Bakkouri and K. Afdel, “Mlca2f: Multi‐Level Context Attentional Feature Fusion for Covid‐19 Lesion Segmentation From Ct Scans,” Signal, Image and Video Processing 17, no. 4 (2023): 1181–1188, https://doi.org/10.1007/s11760‐022‐02325‐w.
I. Bakkouri, K. Afdel, J. Benois‐Pineau, and Gwénaëlle Catheline For the Alzheimer’s Disease Neuroimaging Initiative, “Bg‐3dm2f: Bidirectional Gated 3d Multi‐Scale Feature Fusion for Alzheimer’s Disease Diagnosis,” Multimedia Tools and Applications 81, no. 8 (2022): 10743–10776, https://doi.org/10.1007/s11042‐022‐12242‐2.
I. Bakkouri and S. Bakkouri, “2mgas‐net: Multi‐Level Multi‐Scale Gated Attentional Squeezed Network for Polyp Segmentation,” Signal, Image and Video Processing 18, no. 6 (2024): 5377–5386, https://doi.org/10.1007/s11760‐024‐03240‐y.
S. Li and J. Corney, “Multi‐View Expressive Graph Neural Networks for 3d Cad Model Classification,” Computers in Industry 151 (2023): 103993, https://doi.org/10.1016/j.compind.2023.103993.
V. Pinto, V. Severo, and F. Madeiro, “Optimizing Multi‐View cnn for Cad Mechanical Model Classification: An Evaluation of Pruning and Quantization Techniques,” Electronics 14, no. 5 (2025): 1013, https://doi.org/10.3390/electronics14051013.
H. Xu, F. He, L. Fan, and J. Bai, “D3advm: A Direct 3d Adversarial Sample Attack Inside Mesh Data,” Computer Aided Geometric Design 97 (2022): 102122, https://doi.org/10.1016/j.cagd.2022.102122.
Y. Zhou and O. Tuzel, “Voxelnet: End‐To‐End Learning for Point Cloud Based 3d Object Detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), 4490–4499.
C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space,” in Advances in Neural Information Processing Systems, (2017), 5105–5114.
Y. Song, F. He, Y. Duan, Y. Liang, and X. Yan, “A Kernel Correlation‐Based Approach to Adaptively Acquire Local Features for Learning 3d Point Clouds,” Computer‐Aided Design 146 (2022): 103196, https://doi.org/10.1016/j.cad.2022.103196.
Y. Shi, L. Ma, J. Li, X. Wang, and Y. Yang, “A Region Feature Fusion Network for Point Cloud and Image to Detect 3d Object,” IET Collaborative Intelligent Manufacturing 6, no. 2 (2024): e12100, https://doi.org/10.1049/cim2.12100.
X. Wei, R. Yu, and J. Sun, “View‐gcn: View‐Based Graph Convolutional Network for 3d Shape Analysis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), 1850–1859.
J.‐C. Su, M. Gadelha, R. Wang, and S. Maji, “A Deeper Look at 3d Shape Classifiers,” in Proceedings of the European Conference on Computer Vision Workshops, (2018), 0.
B. Vogel‐Heuser and M. Zou, “Leveraging Inconsistency Management in the Multi‐View Collaborative Modelling of Cyber‐Physical Production Systems,” IET Collaborative Intelligent Manufacturing 1, no. 4 (2019): 126–129, https://doi.org/10.1049/iet‐cim.2019.0019.
H. Su, S. Maji, E. Kalogerakis, and E. Learned‐Miller, “Multi‐View Convolutional Neural Networks for 3d Shape Recognition,” in Proceedings of the IEEE International Conference on Computer Vision, (2015), 945–953.
A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention Is All You Need,” Advances in Neural Information Processing Systems 30 (2017).
Y. Ganin, S. Bartunov, Y. Li, E. Keller, and S. Saliceti, “Computer‐Aided Design as Language,” Advances in Neural Information Processing Systems 34 (2021): 58850–58897.
H. Wang, M. Liu, and W. Shen, “Industrial‐Generative Pre‐Trained Transformer for Intelligent Manufacturing Systems,” IET Collaborative Intelligent Manufacturing 5, no. 2 (2023): e12078, https://doi.org/10.1049/cim2.12078.
Y. Gong, A. Rouditchenko, A. H. Liu, et al., “Contrastive Audio‐Visual Masked Autoencoder,” arXiv preprint arXiv:2210.07839 (2022).
X. Zhang, W. Sun, K. Chen, and R. Jiang, “A Multimodal Expert System for the Intelligent Monitoring and Maintenance of Transformers Enhanced by Multimodal Language Large Model Fine‐Tuning and Digital Twins,” IET Collaborative Intelligent Manufacturing 6, no. 4 (2024): e70007, https://doi.org/10.1049/cim2.70007.
X. Xu, K. D. Willis, J. G. Lambourne, C.‐Y. Cheng, P. K. Jayaraman, and Y. Furukawa, “Skexgen: Autoregressive Generation of Cad Construction Sequences With Disentangled Codebooks,” in International Conference on Machine Learning, (2022), 24698–24724.
R. Wu, C. Xiao, and C. Zheng, “Deepcad: A Deep Generative Network for Computer‐Aided Design Models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), 6772–6782.
H. Cai, Y. Dong, M. Zhu, P. Hu, H. Hu, and L. Jiang, “Intelligent Method Framework for 3d Surface Manufacturing in Cloud‐Edge Collaboration Architecture,” IET Collaborative Intelligent Manufacturing 6, no. 3 (2024): e12115, https://doi.org/10.1049/cim2.12115.
R. Fan, F. He, Y. Liu, Y. Song, L. Fan, and X. Yan, “A Parametric and Feature‐Based Cad Dataset to Support Human‐Computer Interaction for Advanced 3d Shape Learning,” Integrated Computer‐Aided Engineering 32, no. 1 (2025): 73–94, https://doi.org/10.3233/ica‐240744.
R. Fan, F. He, Y. Liu, and J. Lin, “A History‐Based Parametric Cad Sketch Dataset With Advanced Engineering Commands,” Computer‐Aided Design 182 (May 2025): 103848, https://doi.org/10.1016/j.cad.2025.103848.
K. D. Willis, Y. Pu, J. Luo, et al., “Fusion 360 Gallery: A Dataset and Environment for Programmatic Cad Construction From Human Design Sequences,” ACM Transactions on Graphics 40, no. 4 (2021): 1–24, https://doi.org/10.1145/3450626.3459818.
K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large‐Scale Image Recognition,” arXiv preprint arXiv:1409.1556 (2014).
S. Bhojanapalli, C. Yun, A. S. Rawat, S. Reddi, and S. Kumar, “Low‐Rank Bottleneck in Multi‐Head Attention Models,” in International Conference on Machine Learning, (2020), 864–873.
S. Koch, A. Matveev, Z. Jiang, et al., “Abc: A Big Cad Model Dataset for Geometric Deep Learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), 9601–9611.
Z. Wu, S. Song, A. Khosla, et al., “3d Shapenets: A Deep Representation for Volumetric Shapes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2015), 1912–1920.
A. X. Chang, T. Funkhouser, L. Guibas, et al., “Shapenet: An Information‐Rich 3d Model Repository,” arXiv preprint arXiv:1512.03012 (2015).
W. Tang, F. He, Y. Liu, and y. Duan, “Matr: Multimodal Medical Image Fusion via Multiscale Adaptive Transformer,” IEEE Transactions on Image Processing 31 (2022): 5134–5149, https://doi.org/10.1109/tip.2022.3193288.
T. Si, F. He, Z. Zhang, and Y. Duan, “Hybrid Contrastive Learning for Unsupervised Person Re‐Identification,” IEEE Transactions on Multimedia 25 (2023): 4323–4334, https://doi.org/10.1109/tmm.2022.3174414.
© 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.