From concept to manufacturing: evaluating vision-language models for engineering design

Abstract

Engineering design is undergoing a transformative shift with the advent of AI, marking a new era in how we approach product, system, and service planning. Large language models have demonstrated impressive capabilities in enabling this shift. Yet, with text as their only input modality, they cannot leverage the large body of visual artifacts that engineers have used for centuries and are accustomed to. This gap is addressed with the release of multimodal vision-language models (VLMs), such as GPT-4V, enabling AI to impact many more types of tasks. Our work presents a comprehensive evaluation of VLMs across a spectrum of engineering design tasks, categorized into four main areas: Conceptual Design, System-Level and Detailed Design, Manufacturing and Inspection, and Engineering Education Tasks. Specifically in this paper, we assess the capabilities of two VLMs, GPT-4V and LLaVA 1.6 34B, in design tasks such as sketch similarity analysis, CAD generation, topology optimization, manufacturability assessment, and engineering textbook problems. Through this structured evaluation, we not only explore VLMs’ proficiency in handling complex design challenges but also identify their limitations in complex engineering design applications. Our research establishes a foundation for future assessments of vision language models. It also contributes a set of benchmark testing datasets, with more than 1000 queries, for ongoing advancements and applications in this field.

Full text

Translate

Turn on search term navigation

Introduction

Large language models (LLMs) have shown promising performance in domains ranging from medicine (Arkoudas 2023), to law (Katz et al. 2023), to mathematics and coding (Bubeck et al. 2023). The chat-like interfaces offered by tools such as Google’s Bard (Manyika and Hsiao 2023) or OpenAI’s ChatGPT (OpenAI 2023) have enabled millions of users to query and leverage their immense implicit knowledge to assist in tasks ranging from diagnosing diseases to creating manufacturing drawings (Makatura et al. 2024), to supporting the conceptual design stage for a robotic arm (Stella et al. 2023). Their use of natural language as the input modality offers an intuitive interface for humans to express a variety of problems, often in their mother tongue and without the need for technical jargon (Bubeck et al. 2023). The diversity of possibilities is immense and a review identified many tasks within engineering design that could be automated using natural language processing tools (Siddharth et al. 2022). Examples include automated design knowledge retrieval, discovery of user needs, and documentation generation.

However, expressing some tasks in text alone is prohibitively complex. In certain domains, especially those involving spatial information, a picture is truly worth one thousand words. Imagine if one had to write assembly instructions for IKEA furniture using text instructions alone. One would have to describe each part, define its orientation, and instruct where to put screws repeatedly using text similar to “Insert screw A in the third hole starting from the top, where the top of the board is the shortest edge that has a slot through its length.” Unsurprisingly, engineers have relied on visual artifacts to communicate and support their work for centuries (Henderson 1999). From sketches that highlight the key working principles of mechanisms to manufacturing drawings that elicit all the information needed to manufacture, assemble, and inspect parts, visual representations are ubiquitous in engineering.

Recently, powerful multimodal language models have been proposed. In particular, text and vision models like GPT-4V (OpenAI 2023), LLaVA 1.6 34B (Liu et al. 2023b, 2024), and Fuyu-8B (Bavishi et al. 2023) have shown immense promise since their public release. These vision-language models (VLMs) can take images and text as input and generate text as output. Specifically, most VLMs are based on a LLM and extend its capabilities by adding the embedding from a vision encoder as token the LLM’s input context. For example, GPT-4V builds upon the leading LLM, GPT-4V. Researchers have begun exploring the capabilities of VLMs in several application domains. Examples include image understanding and reasoning (Yang et al. 2023b), image and language association (Liu et al. 2023a), and optical character recognition (Shi et al. 2023).

To better understand LLMs’ capabilities, researchers have investigated the performance of LLMs on standardized tests in specific fields such as law (Katz et al. 2023), or medicine (Nori et al. 2023; Rosoł et al. 2023; Takagi et al. 2023), or on specific skills, such as reasoning or mathematics (Arkoudas 2023; Bubeck et al. 2023). However, there is interest in evaluating LLMs and VLMs on their performance on tasks outside of existing standardized exams. Doing this requires creating assessment questions and answers that accurately reflect the task of interest. Feng et al. (2024) evaluate GPT-4 on an upper-level undergraduate computer science course. The authors compiled and classified a dataset of past assessment questions, and released their benchmark dataset. Other researchers have focused on evaluating the applicability of LLMs through their chat interface for specific topics, including design for manufacturing (Makatura et al. 2024), design for additive manufacturing (Badini et al. 2023), and the design of cementitious composites (Cai et al. 2024).

One challenge for comparing different LLMs or VLMs accurately is that most evaluation research is looking at a select-few models on a specific task with a specific dataset and metric. If models’ performance are not compared on the same tasks with the same datasets using the same metrics, it is difficult to draw accurate conclusions about how the two models compare. To address this, Chang et al. (2024) survey current LLM research to understand how and where LLMs are being evaluated. They compile 46 popular benchmarks ranging from general language tasks to domain-specific question and answering tasks. Further, Chang et al. outline the major areas where LLMs have been evaluated thus far. These are: natural language processing tasks; robustness, ethics, biases and trustworthiness; social sciences; natural science and engineering; and medical applications. Chang et al. identify nearly 100 publications in these categories. They point out, however, that there remains a lack of standardization even within each category, but this can be improved by the development of standardized benchmarks and automatic evaluation.

The challenging tradeoff between automatic and human evaluation is quite present in this field (Chang et al. 2024; Zheng et al. 2023; Zhu et al. 2024). For certain tasks around human preferences, automatic evaluation techniques may be insufficient, however human ratings are expensive and hard to scale. Researchers are exploring the use of LLMs as judges, focusing on their agreement with human evaluations and identifying their biases (Zheng et al. 2023). New standardized and automatic benchmarks are also being developed to address this. PromptBench, for example, is a unified Python library for evaluating LLMs with modular options for different LLMs, tasks, datasets, benchmarks, prompts, and prompt engineering techniques (Zhu et al. 2024). Still, many of these benchmarks and tasks are developed for language-only models.

Looking to evaluate vision-language models rather than language-only models, MMBench (Liu et al. 2023c) is a work that specifically addresses benchmarking VLMs via “an objective evaluation pipeline of 2,974 multiple-choice questions covering 20 ability dimensions.” This work highlights the importance of developing VLM benchmarks that are relevant to a given field. However, the 20 ability dimensions in Liu et al. (2023c) are quite general, and not specialized to engineering design tasks.

In this work, we aim to create an evaluation methodology including curated tasks, datasets, and benchmarks to answer the question “Can VLMs effectively perform tasks or assist engineers in engineering design?”

Engineering design encompasses a broad range of tasks within the product design process, as shown in Fig. 1. These include: (1) generating and selecting concepts, (2) choosing between modular and integral structures, (3) sizing components and selecting materials, and (4) prototyping, manufacturing, and inspection.

Objectives and contributions This work is a preliminary and broad exploration of the capabilities of VLMs for engineering design tasks that require textual and visual inputs and sets to establish standardized evaluation tasks. We select diverse engineering tasks that could be automated by VLMs and perform qualitative and quantitative analyses of GPT-4V, summarized in Fig. 1. We then use our quantitative benchmark to evaluate a leading open-source VLM, LLaVA 1.6 34B (Liu et al. 2024). We discuss our findings and implications for using VLMs within engineering design. We aimed, wherever possible, for larger sample sizes and quantitative analyses, despite the lack of API at the start of our investigation. Specifically, our overall contributions are:

We developed and performed quantitative experiments through more than 1000 queries to evaluate GPT-4V and create benchmarks for future VLMs related to (i) design similarity, (ii) early-stage sketch description, (iii) understanding of engineering drawing and CAD generation, (iv) topology optimization result analysis, (v) assessment of manufacturability, (vi) machining feature identification, (vii) defect identification, (viii) textbook problems, and (ix) spatial reasoning.
We developed and performed qualitative case studies of GPT-4V’s performance related to (i) early-stage sketch description generation, (ii) material selection, and (iii) topology optimization understanding.
We created and released datasets for future evaluations, including the input images, input prompts, and answers for all eight quantitative experiments described above on the project webpage https://decode.mit.edu/projects/vlms4design/.
We ran comparative experiments using the open-source VLM LLaVA 1.6 34B to demonstrate the use of these tasks and datasets for benchmarking VLMs in engineering design.
[See PDF for image]
Fig. 1
We explored GPT-4V and LLaVA 1.6 34B’s ability to perform numerous engineering design tasks that utilize both visual and textual information. Panel “Textbook Problems” adapted from Frey and Gossard (2009) under CC BY-NC-SA 4.0. Panel “Material Inspection” adapted from Mundt et al. (2019) under its specific license

Outline of the work Section 3 delves into conceptual design, focusing on design sketches and text descriptions. Section 4 focuses on tasks related to the detailed design stage. In particular, we discuss material selection, engineering drawing analysis and CAD generation. We also use the specific case of topology optimization to the model’s general understanding and the support it can offer for result analysis. Section 5 assesses the general knowledge of manufacturing and tests the performance on inspection tasks. Section 6 investigates the performance of GPT-4V on textbook problems, providing some insights into its over-arching engineering knowledge and skills. Section 7 investigates the performance of GPT-4V on spatial reasoning tests, looking into an important skill across design tasks. Finally, Sect. 8 compares the quantitative results from GPT-4V and LLaVA 1.6 34B, and Sect. 9 offers a broader perspective of the capabilities of VLMs for engineering design and discusses its limitations.

Experimental setup

We developed most of our prompts within the ChatGPT user interface, specifically harnessing the capabilities of the September 25, 2023 update. It should be noted that the principal content of the paper and the primary experiments were conducted using this particular version,1 with a specific focus on the vision feature. We do not include internet access, plugins, or a code interpreter unless explicitly mentioned (for example see Sect. 4.3.1). By the nature of using the ChatGPT web interface, we do not control or vary the “temperature” or “top-k” parameters of the responses. Each repeated experiment was run in a new chat, which we denote as context in this work. When using the GPT-4V API and LLaVA 1.6 34B we used a temperature and top-k of 1.0.

Our methodology is outlined by the following key points:

Emphasis on images. Our primary focus in utilizing VLMs is for image-based tasks. This is where our efforts and resources are concentrated, aiming to explore the model’s understanding and analysis of visual data.
Short text prompts. We utilize short, straightforward prompts that prioritize vision tasks. This is to ensure a focus on visual analysis over the complexity of prompt engineering or creating lengthy custom instructions.
Transparent prompts and answers. In each section, we provide examples of our exact prompts and the exact response from the model, unless shortened with ellipses.

By maintaining these standards, we hope to provide clear experiments that can be used as benchmarks for future VLM evaluations in engineering design. Each section contains the additional methodological aspects related to each topic.

Conceptual design

Overview and motivation Conceptual design is an early stage of the product development process, during which designers identify customer needs, establish target specifications, and then generate, select, and test many concepts (Ulrich et al. 2020). Experts estimate that 70–80% of a product’s life-cycle costs are determined during the early design stages (Corbett and Crookall 1986; Pahl et al. 2007), highlighting the importance of decisions made during this stage. There exists a rich body of research examining concept generation during this stage (Bryant et al. 2005; Toh and Miller 2019; Das and Yang 2022), creativity evaluation in conceptual design (Amabile 1982; Shah et al. 2003; John and Sharon 2009; Baer and Kaufman 2019; Cseh and Jeffries 2019), cost estimates and environmental assessments during conceptual design (Saravi et al. 2008; Fu et al. 2020), and, recently, multimodal machine learning in conceptual design (Yuan et al. 2021; Edwards et al. 2021; Song et al. 2023a; Su et al. 2023).

In conceptual design, two of the primary design modalities are hand-drawn sketches and textual descriptions. Often these modalities are combined, and early-stage designs are represented as a sketch with accompanying text. A vision language model (VLM) such as GPT-4V has great potential to be used as a tool during conceptual design. Some of the main tasks in conceptual design include generating sketches and descriptions of design ideas, comparing and iterating upon those ideas, and ultimately selecting a design to move forward with. In the following experiments, we explore how GPT-4V can act as an aide for these tasks. Specifically, we see how the multimodal capabilities enable GPT-4V to perform engineering tasks when both design sketches and text prompts are included as input. For effective multimodal learning, it is important to have designs with both image and text modalities, and sufficient datasets of these designs. There exist a few multimodal conceptual design datasets that we use in the following experiments. However, an overarching theme in machine learning within the engineering domain is that most datasets are small, which poses a challenge for data-driven models. Large pretrained models like GPT-4V can help overcome this challenge because they have been trained on a plethora of information (although the exact training information of GPT-4V is not yet released), meaning that they have general knowledge about the world. Another challenge is faced during concept selection. Experts suggest generating a multitude of conceptual designs and then down-selecting through design evaluation (Ulrich et al. 2020); however, the evaluation step is often performed by experts, which takes time and resources (Baer and Kaufman 2019). GPT-4V may be able to help engineers during the conceptual design stage by utilizing general knowledge and sketch understanding to interpret and compare designs, move between design representations (text and image), and perform concept selection tasks. We aim to evaluate these capabilities in the following experiments.

Design similarity

Determining if two designs are similar is an important part of conceptual design. Assessing design similarity can act as a proxy for assessing design novelty, which is a common criterion in concept selection (Ahmed et al. 2018). Novelty expresses that a concept is rare, ingenious, imaginative, or surprising, as well as radical or transformational (Verhaegen et al. 2012). However, novelty evaluation is often subjective. It can be easier for humans to articulate why they might rate designs as similar than why they would rate one design as more novel than another (Ahmed et al. 2018). For this reason, past work has studied how humans assess similarity of concepts (Ahmed et al. 2018) as a method to build idea maps and identify novelty. Similarity comparisons can also help explore the design space by identifying clusters of similar ideas, potentially helping with faster design space exploration. Recently, researchers have compared how human evaluations of similarity compare to computationally determined similarity, and found that they diverge based on the level of abstraction of a product (Nandy and Goucher-Lambert 2022).

Three main challenges arise with human evaluations of similarity:

Evaluation speed and cost Human evaluations are very expensive. Both the time and the cost of these evaluations are exacerbated as the number of designs increases since the number of similarity queries scales with the number of designs.
Self-consistency Humans may make different similarity assessments when they repeat the same evaluation.
Transitive violations Given designs A, B, and C, one cannot say that A is more similar to B, B is more similar to C, but that C is also more similar to A. This would violate the transitive property of inequality, since if AB < AC (where AB is a measure of similarity between design A and design B) and BC < BA, then these imply that CB < CA, so saying CA < CB cannot be true. A violation of this sort can be tested when the same three designs, which we call a triplet, are assessed for similarity multiple times.

In the following experiments, we evaluate if GPT-4V can effectively assess the similarity of designs, i.e. with high self-consistency and a low number of transitive violations. We are replicating this experiment which was completed by human raters in Ahmed et al. (2018), and here novelty is defined as distance from other designs, and is not impacted by relevance to the design task. Our methodology, an example of our prompt, and the table of results are provided below.

Methodology We tasked GPT-4V with performing the same experiment that eleven human raters performed in Ahmed et al. (2018). We have 10 early-stage design sketches from Starkey et al. (2016) and Toh and Miller (2016) as shown in Fig. 2. Each design is defined by a hand-drawn sketch, possibly annotated, and a handwritten description, referred to as “text description.” We group them into groups of three, which we call triplets. As shown in Fig. 3, we provide a triplet of design sketches labeled A, B, and C, and ask GPT-4V which design is most similar to A. Since we have 10 designs, we can make 360 triplets such that each design is considered design A for all 36 combinations of the other 9 designs as designs B and C. When given 360 of these triplet examples, we assessed if GPT-4V commits transitive violations. We then repeated 50 of the examples to evaluate GPT-4V’s self-consistency.

[See PDF for image]

Fig. 2

Ten conceptual designs of novel milk frothers. We task GPT-4V with assessing the similarity of these designs to one another. The handwritten descriptions at the bottom of each design after the typed “Idea Description:” are referred to as “text descriptions” throughout this work. For example, the text description of design 0 is “countertop jet engine”

[See PDF for image]

Fig. 3

Assess design similarity

In addition, we utilize the generalized non-metric multidimensional scaling (GNMDS) technique introduced by Agarwal et al. (2007) to find 2-D embeddings of design sketches using triplet responses and generate a visualization of the ten designs where designs that are closer to each other are considered more similar. This technique is the same as used in Ahmed et al. (2018) to generate a map of these designs from human ratings. The resultant map, referred to as the idea map of GPT-4V, is shown in Fig. 4. The idea map shows a striking resemblance to human-generated idea maps reported in literature. We observe that the three designs that show cups filled with milk are grouped together (designs 2, 5, and 6) as well as the two bicycle-based designs (designs 3 and 4). This clustering of similar designs was also observed in the idea map of all human raters combined, shown in Fig. 8 in Ahmed et al. (2018). GPT-4V’s idea map also places Sketch 0 further away from all other sketches, denoting that it was perceived as most dissimilar. Coincidentally, the most novel sketch identified by the aggregated human ratings was also Sketch 0. Sketch 0, proposing a countertop jet turbine to froth milk was also the most novel sketch rated by the expert in their work. This serves as a validity check, demonstrating to us that GPT-4V’s assessment of the similarity of sketches aligns with human ratings. We note that in past work, each human rater has a different map, and GPT-4V creates a unique map as well. The variability in individual human idea maps is likely influenced by diverse criteria for judging similarity. Consequently, establishing a definitive standard for sketch similarity is challenging. Therefore, we compared our results to the aggregated map by eleven human raters to gauge how GPT-4V’s assessments conformed with collective human wisdom.

These initial findings pave the way for future research avenues. First, GPT-4V’s capability to create idea maps is not only scalable, and ideal for evaluating a large number of design items, but also overcomes limitations faced by previous studies reliant on time-consuming human ratings. Second, the use of triplet queries for generating idea maps extends beyond sketches to other design forms like prototypes, 3D models, or multimedia, offering a novel approach to evaluating design similarity. These maps are valuable tools for designers, enabling them to better understand their design domain and leverage GPT-4V for more effective exploration of the design space.

[See PDF for image]

Fig. 4

A map of the milk frother design sketches where sketches that are closer to each other are more similar. These are based on the responses by GPT-4V for 360 triplet similarity queries. We observe that the map clusters similar designs together and places unique designs further away from other designs

Discussion: design similarity From the quantitative experiments using a total of 410 queries, with results summarized in Table 1, we find that GPT-4V is able to assess the similarity of designs with greater or equal performance as human raters. In the 360 trials, GPT-4V made only five transitive violations, which matches the lowest number of transitive violations made by any of the eleven human raters. Additionally, in our trials, GPT-4V was self-consistent 94% of the time, which is greater than any of the human raters. A naive model could still be self-consistent without actually understanding much about a design sketch. So to gain insight into GPT-4V’s similarity assessment, we plot the designs using the GNMDS embedding technique, shown in Fig. 4. We observe sensible clustering of three design sketches whose major features are cups with milk, and two design sketches whose major features are bicycles. While future work should explore how well VLMs perform at this task for other datasets, these results offer a promising suggestion that GPT-4V can effectively assess the similarity of conceptual design sketches. Future VLMs can be evaluated using these same methods to compare them against this version of GPT-4V. We have provided the dataset including all triplets for this purpose. These experiments were meant to test understanding, analysis, and evaluation. The results suggest that GPT-4V is able to understand and analyze design sketches in order to assess their similarity. Furthermore, assessing similarity among design triplets is a mundane and repetitive task. GPT-4V’s ability to perform this task may mean that human raters do not have to, which can save time and resources both in dataset creation and in the design process.

Table 1. A summary of the self-consistency and number of transitive violations of GPT-4V (highlighted in bold) when evaluating 360 triplets for which designs are more similar

Rater	Self-consistency (%)	Transitive Violations
1	91.6	5
2	50.0	5
3	83.3	5
4	75.0	10
5	58.3	10
6	58.3	20
7	41.6	8
8	41.6	20
9	58.3	11
10	75.0	12
11	58.3	5
GPT-4V	94.0	5

We compare the results with the corresponding values for 11 human raters reported in Ahmed et al. (2018). GPT-4V has higher self-consistency than the human raters and its number of transitive violations equals the lowest human rater value

Design descriptions

Through these experiments, we aim to evaluate how well GPT-4V understands the different representations of a design, in this case, textual and sketch. Understanding sketches is a first step in being able to evaluate and compare them, which is one of the end goals of the conceptual design phase (Ulrich et al. 2020). We task GPT-4V with matching a design to its correct description given a number of options, and we also task GPT-4V with generating textual descriptions of a design given just a sketch. We specifically chose to perform description matching (Sect. 3.2.1) in the form of multiple-choice questions because this allows for quantitative analysis. Additionally, we provide the exact questions and results so that future VLMs can be similarly evaluated. In fact, we evaluate GPT-4V, and LLaVA 1.6 34B in this manner; results are shown in Tables 2 and 20. While these description matching and generating tasks do not directly translate to common engineering tasks during the design process, we believe that GPT-4V’s performance on them sheds light on its ability to understand information from one modality (sketch) and then synthesize information in another modality (generating text). Furthermore, if a tool can automatically generate accurate and useful textual descriptions of conceptual design sketches, this could allow engineers to (1) create an easily-searchable catalog of early-stage designs, and (2) more easily generate multimodal datasets of paired sketches and text descriptions, which are necessary for multimodal machine learning in the engineering domain (Song et al. 2023b). Automatically generating relevant textual descriptions for hand-drawn sketches can also help communicate design ideas to design team members and potential stakeholders, which is a primary role of sketching (Das and Yang 2022). Lastly, it can help human raters judge design ideas for creativity, novelty, quality, and other common design metrics (Shah et al. 2003).

Match a description to a design

Given an image of an early-stage design sketch, and four different design description options, we test if GPT-4V can identify the correct description. We analyze GPT-4V’s performance on these simple tasks to gain a basic understanding of whether more challenging description generation tasks are possible. Our methodology, four examples of our prompts, and a table of our results are provided below.

Methodology We assessed if GPT-4V can match a design sketch to its correct text description for three different cases:

We provide the whole image including the handwritten text description, as well as four description options including “None of the above” (Fig. 5 upper left).
We provide the image with the handwritten text description removed, as well as four description options including “None of the above” (Fig. 5 upper right).
We provide the image with the handwritten text description removed, and only three text description options, removing the “None of the above” option (Fig. 5 lower left and lower right).

We ran 90 total queries: three trials each comprised of ten multiple choice questions for each of three different cases2 We included one multiple-choice question for each of the designs in Fig. 2. Examples of these questions are shown in Fig. 5. As the answer options, we provided three text descriptions from among the 10 design sketches. Table 2 displays the results for each of the three cases, as well as the final score out of 10.

[See PDF for image]

Fig. 5

Match a design to the correct description

Table 2. Results for the three multiple-choice design description matching experiments

	Correct answer	With text description	No text description	No text description, no “None of the above”
0	B	B	B	B
1	A	A	A	A
2	C	C	D	B
3	A	A	A	A
4	B	B	B	B
5	C	C	D	B
6	C	C	D	C
7	C	C	D	D
8	B	B	A	A
9	B	B	B	B
Trial 1 score		10/10	5/10	6/10
Trial 2 score		10/10	5/10	8/10
Trial 3 ccore		10/10	6/10	7/10
Average		100%	53.3%	70%

90 queries were run across three trials, 0 queries for each of the three cases. The full results for Trial 1 are displayed, as well as the scores for all three trials

GPT-4V perfectly matches designs to their descriptions when the handwritten description is provided in the sketch. When the handwritten description is removed, GPT-4V’s errors are typically a result of selecting “None of the above,” and when that option is removed, and it is forced to select a description, its performance increases (from 5.33/10 to 7/10). These results, combined with the results of LLaVA 1.6 (see Table 20), suggest that the presence of textual context in sketches significantly enhances model accuracy, underscoring the importance of integrating text with visual data. The variability in model performance, particularly GPT-4V’s superiority in scenarios without handwritten descriptions, suggests that model choice should be tailored to specific task requirements. Additionally, GPT-4V’s tendency to select “None of the above” when uncertain highlights its cautious approach in ambiguous situations, reflecting a strategy to manage uncertainty. This behavior, along with the contrasting error patterns between GPT-4V and LLaVA 1.6, points to the need for deeper understanding and improvement in how different models process and interpret visual information, especially in the absence of textual cues. These findings are crucial for optimizing the use of VLMs in conceptual design contexts.

Generate a design description from a sketch

Given an image of an early-stage design sketch, can GPT-4V generate a relevant and accurate design description?

We performed this experiment for five early-stage design sketches with varying drawing scores. The drawing scores are based on a Consensual Assessment Technique evaluation of students’ milk frother designs (Starkey et al. 2016; Toh and Miller 2016). The scores can range from 1–7, but within the dataset of sketches, the scores range from 1–6. Table 3 shows the results for these design sketches. The selected sketches were chosen at random from among all sketches with a similar drawing score. We show how GPT-4V responds when simply asked to describe the design, versus when provided with a description of the original design task given to the students and then asked to describe the design. For brevity, in both cases, we prompt GPT-4V to respond in three sentences. The exact prompts are provided in the column headers of Table 3.

Table 3. GPT-4V generated descriptions from design sketches

[See PDF for image]

We also include the expert-rated drawing score of each sketch. The sketches are ordered by descending drawing score

Discussion: design descriptions In this section, we aimed to assess GPT-4V’s ability to match different representations of an early-stage conceptual design and to generate one representation from another. The tasks we generated to explore these capabilities were: match a design sketch with its correct textual description, and generate a textual design description from a sketch. For each of these tasks, we gave various forms of the sketch to understand how the amount of handwritten text and the drawing skill in each sketch would affect these results. The quantitative results from the description matching experiment, shown in Table 2, provided a basic understanding of whether the later description generation tasks were possible at all. The results showed that given an entire design sketch including the handwritten text description, GPT-4V was able to match the sketch to the text description for 10/10 of the questions across all three trials. This result essentially just assured us that GPT-4V could comprehend the hand-written text in these drawings.

With this point verified, we next tested description matching if we removed the handwritten text description from the image. In this case, GPT-4V was able to match the sketch to the text description for 5.33/10 of the questions on average. While this is still better than random chance, which would be an average of 2.5/10, this result demonstrated how important providing both modalities, text, and image, is in this design stage. We noticed that many of the incorrect answers were GPT-4V selecting the “None of the above” option, suggesting that none of the descriptions matched the design. In fact, 4/5 of the incorrect answers for Trial 1 and 2, and 3/4 incorrect answers for Trial 3 occurred this way. This is sensible to us, as the design sketches are often visually simple compared to their textual descriptions. An example of this is design 5 in Fig. 2, which visually looks like a cup of milk, but whose text description is “Centrifuge of milk.” With these results in mind, we tested how well GPT-4V would match the sketch to the description if we removed the “None of the above” option. This led to an improved average of 7/10 correct. One interesting example is shown in the lower right image of Fig. 5, in which GPT-4V generated its own option because the model determined that the sketch did not match any of the provided options, despite one being correct.

We further explored GPT-4V’s capability to generate textual descriptions from design sketches. The results are shown in Table 3. We explore how well the model generates design descriptions for three designs with different “drawing scores,” or levels of perceived drawing ability. For each design, we task GPT-4V with generating a description using two different prompts, one with more information:

Please describe this design in 3 sentences.
A student was asked to develop a new innovative product that froths milk in a short amount of time. Please describe this design in 3 sentences.

Qualitatively, we assess that the model is able to generate useful and accurate text descriptions of designs even for sketches with very low drawing scores. We highlight in green the parts of the description that we believe showed understanding and relevance. GPT-4V extrapolates beyond just the given text to describe both the form and the function of the designs. This result once again highlights the importance of having both the text and image modalities in the design, as the generated descriptions pull heavily from the handwritten text to contextualize the sketch and explain how the design might function. Figure 6 demonstrates GPT-4V’s ability to understand, infer, extrapolate, and generate information about a design from its sketch. For example, the sketch includes a labeled belt and pulley system, and GPT-4V includes this in the generated description: “Belt and pulleys: This part seems to be connected to an AC Motor. The belt and pulley system likely drives the turbine mechanism, converting the motor’s rotational movement into the desired frothing action.” We assess that GPT-4V is able to effectively generate textual descriptions when provided with a detailed sketch including handwritten text. This is true for both prompt types, the one that just requests “Please describe this design in 3 sentences,” as well as the prompt that provides additional information about the task. We do not observe a difference in the design descriptions that result from these different prompts, but further research could explore this more. It is important to note that LLMs, like GPT-4V, may hallucinate. This means GPT-4V may perceive patterns or objects in the input that are imperceptible to humans, producing outputs that do not make sense. Engineers should perform checks on generated descriptions, and additional research should be done regarding generating text descriptions for sketches in many domains. However, from our results, GPT-4V is able to generate sensible text descriptions from early-stage sketches. This ability may help engineers (1) create an easily-searchable catalog of early-stage designs, and (2) more easily generate multimodal datasets of paired sketches and text descriptions.

[See PDF for image]

Fig. 6

Generate a design description

Concept selection

A core component of conceptual design is concept selection (Okudan and Tauhid 2009; Miller et al. 2021). There are various concept selection methods, ranging from those based on decision matrices, to uncertainty modeling, to heuristics (Okudan and Tauhid 2009). One of the most widely used concept selection methods for engineers is the Pugh Evaluation Matrix, sometimes called the Pugh chart (Pugh 1991, 1995).

A Pugh Chart, also known as a Pugh Matrix, is a decision-making tool used in engineering and design. It involves comparing multiple options against a set of criteria, using a baseline for reference, to determine the most suitable choice. Each option is scored based on how well it meets each criterion compared to the baseline, facilitating an objective evaluation of alternatives. The first step in creating a Pugh Chart is defining selection criteria, which will be used to evaluate and compare concepts. The method may vary, but a common practice is to then select a benchmark design, and score all other designs qualitatively based on how they compare to the benchmark for each of the selection criteria.

In the following experiments, we explore GPT-4V’s ability to both generate selection criteria given a design prompt, and use concept selection methods on design sketches. We utilize a case study presented in the Concept Selection chapter of Product Design and Development by Ulrich et al. (2020). This case study provides a design task:

“A medical supply company retained a product design firm to develop a reusable syringe with precise dosage control for outpatient use.”

Seven design sketches of reusable syringes are also included along with a Pugh Chart and selection criteria. We use these as qualitative benchmarks with which we compare GPT-4V’s ability to generate selection criteria and a Pugh Chart.

Generating selection criteria

In Fig. 7, we explore if GPT-4V can generate selection criteria that would be used to evaluate designs, given a description of a design task. As a baseline to assess the generated criteria, we utilized the selection criteria and Pugh Chart provided in Ulrich et al. (2020); these were made for this design task and used the same seven design concepts that we used. Table 4 shows the ground truth selection criteria from the textbook, and the matching selection criteria generated by GPT-4V (if applicable). As seen in Table 4, GPT-4V generated matching selection criterion for each of the ground truth criterion, sometimes via multiple criteria. For example for the ground truth criterion “Ease of manufacture,” the two matching criteria generated by GPT-4V are “Cleaning and Sterilization: Ease of disassembly and reassembly for cleaning” and “Scalability: Design should be adaptable for mass production.”

[See PDF for image]

Fig. 7

Generating selection criteria for a design task

Table 4. Selection criteria provided by the Product Design and Development textbook and the GPT-4V equivalent

Textbook selection criteria	GPT-4V equivalent
Ease of handling	Ease of use: Ergonomic grip and handling
Ease of use	Ease of use
Readability of dose settings	Ease of use: Clear markings and indicators for dosage
Dose metering accuracy	Precision and accuracy
Durability	Durability and reusability
Ease of manufacture	Cleaning and sterilization: Ease of disassembly and reassembly for cleaning Scalability: Design should be adaptable for mass production
Portability	Storage and portability

Creating a Pugh Chart

In Fig. 8, we explore GPT-4V’s ability to perform concept selection by generating a Pugh Chart. In particular, given selection criteria and images of various designs, can GPT-4V analyze and evaluate the designs and format this evaluation in a Pugh Chart? The results are discussed in the following section.

[See PDF for image]

Fig. 8

Generating a Pugh Chart for concept selection

Discussion: concept selection Through these experiments, we explored GPT-4V’s ability to perform two common concept selection tasks: generating selection criteria given a design task and making a Pugh chart given several design concepts. We found that GPT-4V was able to assess a design task and generate many relevant selection criteria, as shown in Fig. 7. For example, given this design task “A medical supply company retained a product design firm to develop a reusable syringe with precise dosage control for outpatient use,” GPT-4V generated selection criteria such as “Safety and Biocompatibility,” “Ease of Use,” and “Precision and Accuracy.” These criteria highlight that the design must be user-centered and safe in healthcare settings.

Table 4 shows the baseline selection criteria provided for the design task from Ulrich et al. (2020), as well as the equivalent selection criteria generated by GPT-4V, if applicable. We observe that for each of the seven criteria that the baseline Pugh Chart used, GPT-4V outputs an equivalent criterion. It is important to note that some of GPT-4V’s equivalent criteria were subcategories, such as GPT-4V’s “Ease of Use: Clear markings and indicators for dosage,” which mapped to the textbook’s “Readability of dose settings.” These results demonstrate that GPT-4V is able to generate many relevant selection criteria, but that an engineer should still read the raw output and select relevant criteria as well as separate subcategories of certain criteria.

When tasked with creating a Pugh Chart given selection criteria and several designs, we found that GPT-4V understood what a Pugh Chart was and how to generate one, however, it was often reluctant to create one given the limited information, as shown in Fig. 8. GPT-4V was able to generate an empty Pugh Chart with the correct matrix format (not shown in Fig. 8, which switched the typical rows and columns), and also understood that it would be filled with qualitative comparisons of concepts, with one reference concept. However, it would only fill in the Pugh chart with hypothetical values given the lack of information about each concept. For example, in one trial, GPT-4V stated “Since I cannot physically interact with the concepts to evaluate them against the criteria you’ve provided, such as “Ease of handling” or “Durability”, I am unable to make informed decisions about the scores. These criteria often require subjective assessments or empirical testing that cannot be performed without direct interaction with the designs.” In every instance of running the task in Fig. 8, GPT-4V refused to fill in the Pugh chart with anything other than random hypothetical values. Perhaps if an engineer provided more information about each concept, GPT-4V would have been able to generate an accurate Pugh chart, however, it failed to do so within our task format.

Overall, our findings suggest that GPT-4V can be potentially effective in assisting human designers in identifying key factors that should be considered in the design process. However, while GPT-4V can generate equivalent criteria to those used in traditional methods, its outputs may need refinement, such as categorizing subcriteria. In terms of creating a Pugh chart, GPT-4V understands the concept and can format the chart correctly, but its reluctance to fill in the chart without extensive information indicates a limitation. This suggests that while GPT-4V can be a useful tool for structuring and initiating the concept selection process, human input remains crucial for detailed analysis and decision-making. The implications for practitioners are clear: VLMs such as GPT-4V can be a valuable aid in the initial stages of design concept evaluation, but they may require careful oversight and additional information to realize their full potential in more complex decision-making tasks.

Summary

A concise summary of our assessment areas for GPT-4V’s conceptual design abilities, and our findings for each are provided below.

Assessing design similarity Section 3.1—How does GPT-4V’s consistency in assessing design sketch similarity compare to human benchmarks?
- We measure consistency using two measures from Ahmed et al. (2018)—self-consistency and transitive violations in assessing sketch triplet queries. The model is able to assess design similarity with higher self-consistency than human raters (94% compared to 62.8% average for human raters) and as few transitive violations as the top human raters.
Matching design representations Section 3.2.1—Can GPT-4V accurately match design sketches with their text descriptions under varying information conditions?
- We ran 90 queries: three trials each comprised of ten multiple choice questions for each of three different cases. We found the following results. When provided the entire design sketch including a handwritten description, the model matched a design sketch to its appropriate text description 10/10 times for all three trials, however with the handwritten description removed, the score dropped to an average of 5.33/10, which we can compare to a score of 2.5/10 for randomly matching. Incorrect answers were often in the form of choosing “None of the above,” so when given the same task without the “None of the above” option, the score rose to an average of 7/10.
Generating design descriptions Section 3.2.2—Is GPT-4V capable of generating effective descriptions for early-stage design sketches?
- Qualitatively, we find that GPT-4V is able to generate accurate and useful design descriptions given hand-drawn sketches.
Generating selection criteria Section 3.3.1—How effectively does GPT-4V generate concept selection criteria in engineering design?
- In our case study, we find that when provided a design task GPT-4V generates useful selection criteria that match those generated by design professionals.
Generating a Pugh Chart Section 3.3.2—What is the extent and limitation of GPT-4V’s ability to generate Pugh charts for conceptual design evaluation?
- GPT-4V understands what a Pugh chart is, and can provide examples of the formatting, but often will not fill in the Pugh chart, or simply provide a “looks-like” Pugh chart given just a design task and design sketches. The model cites that it cannot fill in the Pugh chart without additional context about the designs, suggesting it may be able to if provided with more information.

System-level and detail design

Overview and motivation After conceptual design, engineers move into the system-level and detail design portion of the product development process (Ulrich et al. 2020). During this phase, engineers flesh out the specifics of the design as they select materials, develop CAD models, iterate to optimize the design, abide by constraints, and create engineering drawings for subsequent manufacturing steps. For example, an experienced mechanical engineer, embarking on the design of a new lightweight bicycle frame, would necessarily go through a material selection phase—to identify materials that are both strong and light—a CAD generation phase—to develop the geometric details of the design—and a design optimization phase—to refine the design to best meet a set of objectives. These steps are very common and critical to the detail design stage and inform the structure of this section.

The generation of system-level and detail designs draws upon many skill sets, including spatial reasoning, knowledge of specific CAD software programs, and physics-based principles. Many of the tasks that engineers perform during this step of the product development process are inherently visual. They must consult material charts and graphs, read and create engineering drawings, develop 3D geometric models in CAD GUIs, and visually check for geometric constraint violations. Given VLMs’ image-processing capabilities and emergent technical knowledge, we test, in the follow sections, the models’ abilities to support visual-based detail design tasks.

Material selection

Oftentimes, the selection of material comes early-on in the detail design phase, as material choice informs both the design and the utilized manufacturing method. Material selection requires balancing various constraints and requirements, such as material strength, stiffness, cost, density, embodied energy, electrical resistivity, and thermal conductivity (Ashby 2016). Choosing a material that meets an extensive list of requirements and constraints often requires cross-referencing multiple tables or charts, such as Ashby charts (Ashby 2016). Ashby charts enable engineers to visually represent the trade-offs between various material properties for different families of materials. Provided with these charts for different material properties, LLMs have the potential to condense material information and identify materials that meet certain criteria. Several groups have explored GPT’s ability to assist with material considerations. In Saka et al. (2023), the authors used the GPT API to integrate ChatGPT into the building information modeling process to assist with material selection for components of a building. In Makatura et al. (2024), the authors looked at GPT-4’s ability to propose a manufacturing process for a part based on material selected. In Buehler (2023), the authors trained a model, MeLM (Mechanics Language Model), which was used for material-related tasks, like proposing microstructure designs that meet certain stress–strain responses. In this section, we conduct three independent experiments involving Ashby charts and material selection. To analyze the consistency of responses, each experiment is repeated three times.

Methodology For the material selection experiments, we provide GPT-4V with various Ashby charts, which are commonly used by engineers to evaluate trade-offs between different materials. We conduct three different types of experiments: Ashby chart look-up, Ashby chart cross-referencing, and Ashby chart material selection for a beam. Each experiment type is repeated three times. One of the repetitions of each of the three experiment types can be seen in Fig. 9 (look-up), Fig. 10 (cross-referencing), and Fig. 11 (material selection for a beam).

[See PDF for image]

Fig. 9

Ashby chart look-up: Identifying materials that meet stiffness and density constraints

[See PDF for image]

Fig. 10

Ashby chart cross-referencing: Identifying materials that meet stiffness, strength, and density constraints

[See PDF for image]

Fig. 11

Application of Ashby charts to the material selection for a beam

For the look-up experiment, we provide the model a density vs. Young’s modulus Ashby chart. We ask the model to identify materials that have a density between 7–10 and a Young’s Modulus greater than 100 GPa. The purpose of this experiment is to assess whether GPT-4V can perform a simple “look-up” of feasible materials from the chart.

For the cross-referencing experiment, we give GPT-4V two Ashby charts, one showing density vs. Young’s modulus and another showing density vs. strength. We then ask GPT-4V to cross-reference the two charts, identifying materials that have a density in between 1–3 , a Young’s modulus between 0.01 and 0.1 GPa, and a strength of 3 MPa. The purpose of this experiment is to understand if GPT-4V can synthesize information from two material charts together.

For the material selection for a beam experiment, we provide the model a density vs. Young’s modulus Ashby chart. We ask GPT-4V to help us select a material for a hypothetical beam, given general requirements that the beam must be both stiff and light. The purpose of this experiment is to understand if GPT-4V can translate the general requirements into material requirements and propose appropriate material families based on those requirements.

Discussion: material selection Overall, we conclude that GPT-4V performs well at identifying broad material families that exhibit general properties (e.g. low density), but performs less well when given specific requirements or constraints (e.g., density between 1.0 and 3.0). This finding is illustrated by the results of the three experiments. The responses from all experiments and repetitions can be seen in Table 5. One repetition of each experiment is displayed in-full in Figs. 9, 10, and 11. For the Ashby Chart Look-Up experiment (see Fig. 9), we would expect the correct answer—materials that have a density between 7 and 10 and that have a Young’s modulus greater than 100 GPa—to be steel, Ni-alloys, and Cu-alloys. (Zn-alloys and Mo-alloys lie on the border of the feasible region). For all three repetitions, GPT-4V correctly answered that steels would be feasible materials. For two out of the three times, it also mentioned that Ni-alloys would meet the specified requirements. However, in all three iterations it also included materials in its answer—either Ti-alloys or WC-Co—that do not meet our specifications; Ti-alloys have a density less than 7 and WC-Co materials have a density greater than 10 . In none of the iterations did GPT-4V mention Cu-alloys in its answer, although this material group meets both the stiffness and density specifications.

Table 5. Summarized results for the GPT-4V material selection experiments

[See PDF for image]

*For these responses, GPT-4V expressed concern about making definitive material choices based on the resolution of the provided Ashby charts

GPT-4V performed poorly in our Ashby Chart Cross-Referencing Experiment (see Fig. 10 for the full response from one of the repetitions). The correct answers to the question were the soft butyl and elastomer materials, which have densities between 1.0 to 3.0 , Young’s moduli between 0.01 and 0.1, and a strength of 3 MPa. Across the three repetitions, GPT-4V never identified either of these materials as meeting our requirements. Overall, it tends to conclude that polymer foams, foams, polymers, or woods would be suitable choices, but these materials do not meet our specifications. Polymer foams, for example, don’t meet the density requirement; many polymer foams have densities between 0.1 and 0.3 , suggesting that GPT-4V possibly confuses our 1.0–3.0 density specification with this 0.1 to 0.3 range. It is important to note that for two of the three repetitions of this experiment, GPT-4V was hesitant to provide an answer due to the “resolution” of the provided images.

Our Ashby Chart Look-Up and Cross-Referencing experiment reveals possible areas of improvement needed in handling precise numerical data and complex information synthesis. The model’s struggle with accurately interpreting numerical constraints, as evidenced by these two experiments, highlights a shortfall in applying exact numerical ranges. Furthermore, its inability to effectively cross-reference and synthesize data from multiple sources underscores a challenge in processing multi-dimensional information. This issue is particularly pertinent in engineering where precision and multi-faceted data analysis are crucial.

GPT-4V performs much better when asked to propose potential material families for a beam that needs to be both lightweight and stiff (see Fig. 11). For all three experiment repetitions, GPT-4V correctly translates the stiffness specification into a high Young’s modulus requirement and the light-weight specification into a low-density requirement. For all trials, it correctly asserts that materials we would want to consider are towards the top left of the provided Ashby chart and proposes engineering composites and engineering alloys (and for two of the three repetitions, also proposes engineering ceramics and wood products). We conclude that while GPT-4V struggles to identify materials that meet specific numerical requirements, it is much better at proposing material families that meet general specifications.

Overall, the use of GPT-4V in material selection in engineering design showcases its potential as a supportive tool in the preliminary stages of decision-making and as an educational aid in materials science. Its ability to suggest material families based on general requirements can streamline the initial phases of design, allowing engineers to focus on finer details. This integration points towards a future where AI complements traditional engineering tools, enhancing the efficiency of design workflows. However, it also raises important ethical and practical considerations, such as over-reliance on VLM models without knowing their limitations and ensuring AI-generated recommendations of material align with safety standards and environmental concerns. For example, the varying answers across the repeats highlight the non-deterministic nature of ChatGPT’s interface, which engineers need to consider.

Transitioning from the exploration of GPT-4V’s capabilities in material selection, the research now shifts focus to another critical aspect of engineering design: VLM’s ability to interpret complex engineering drawings and contribute to the generation of Computer-Aided Design (CAD) models.

Engineering drawing analysis and CAD generation

A critical step of the detailed design process is the generation of 3D models. Computer-aided design (CAD) software enables the creation of detailed solid models, allowing engineers to encode precise part dimensions and assembly relations between parts (Nyemba 2022). These CAD models pave the way for design for manufacturing, since detailed engineering drawings with manufacturing specifications are typically created from the 3D models (Nyemba 2022). CAD models are also useful for the different ways in which designs and parts can be visualized (e.g. cross-sections, wireframe views, etc.), enabling engineers to easily consider different aspects of their design (Nyemba 2022). We hypothesized that GPT-4 with vision would be better able to assist with CAD generation and engineering drawing analysis than GPT-4 since these two design forms—CAD models and engineering drawings—are inherently visual mediums.

We gain inspiration from the work of researchers who have explored the potential of GPT to assist with converting text into CAD (Makatura et al. 2024; Nelson et al. 2023). For example, Makatura et al. devoted a large section of their work to the exploration of GPT-4’s ability to generate CAD designs from text. They looked at text to scripts that would generate 2D designs (DXFs and SVGs), demonstrating relative success in the design of 2D pieces of a cabinet. Another work then performed several case studies to illustrate GPT-4’s ability to convert text into scripts for 3D designs, using both CSG-based CAD languages and sketch-based CAD languages (Makatura et al. 2024). These experiments showcased mixed success, oftentimes requiring prompts to be engineered with specific function signatures. The authors noted reasoning challenges, specifically when it came to spatial reasoning. They also cited iterative ability as both a capability and a drawback of GPT-4, as they found the model was sometimes able to correct errors through continued chat iterations but that more iterations also led to limited memory of previous information in a chat. However, a key limitation of past work is that it relied on text-only LLMs, while CAD is inherently a task that has significant visual aspects. In our study, we focus on evaluating the capabilities of VLMs.

Methodology To assess GPT-4V’s ability to analyze engineering drawings and generate CAD, we utilized a seven-prompt experiment structure (carried out in the same context window). An example of a full experiment can be seen in Fig. 12. The first two prompts (P1 and P2) of each experiment assess GPT-4V’s ability to analyze engineering drawings. We test the model on two aspects: 1) its ability to describe a part based on an engineering drawing and 2) its ability to extract dimensions from an engineering drawing. We correct any responses to these questions that the model answers incorrectly, so we can independently score the next part of the experiment. In prompts three through seven (P3–P7), we evaluate the model’s ability to generate a script that encodes the CAD of a part. We ask the model to do this based on the previously provided engineering drawing, the previously extracted part dimensions, and a CAD scripting language that we specify. In this part of the experiment, we score the model on the CAD that its script generates. If the CAD is not correct on the first attempt, we feed it back views of the generated CAD and ask it to iterate to fix any discrepancies it sees between the generated CAD and the original engineering drawing. In this way, our feedback is much like the model’s built-in Code Interpreter, but instead for CAD. We repeat this iterative process until five different CAD generation attempts have been made.

In total, we ran nine experiments, each with seven prompts (P1-P7) conducted sequentially in a single chat context. Three groups of three experiments used identical prompts (conducted for repetition), and the difference between the three groups of experiments is in the CAD scripting language specified. We now further elaborate on the experiment structure and the method for scoring each experiment:

The prompts for these queries are identical across all nine experiments.

Part description from an engineering drawing—Prompt 1 (P1). GPT-4V is given an engineering drawing of a block with a blind hole, as seen in Fig. 12, P1. We ask the model to provide a description of the part. We chose to use this block-with-blind-hole part as the subject of our experiments since it represents one of the most basic yet functional parts that can be created using CAD, necessitating only two sketches and basic cut/extrude operations. The drawing follows typical engineering drawing conventions and was created in an undergraduate-level engineering course.3 As such, it is a drawing that we would expect undergraduate-level mechanical engineers to readily understand.
- Scoring (1 point possible): We assign 1 point if GPT-4V correctly mentions that the part is a “block with a hole” or “block with a blind hole.” Any mention of a “through” hole receives no points as it shows an incorrect understanding of the underlying geometry.
Dimension extraction from an engineering drawing—Prompt 2 (P2). We next ask GPT-4V to extract the dimensions shown in the engineering drawing, assigning them appropriate names. We specifically ask GPT-4V to not extrapolate to create dimensions that are not explicitly called out in the drawing.
- Scoring (10 points possible): 1 point is awarded for each of the five numbers shown on the drawing (8.00, 5.00, 12.00, 4.00, and ø5.00) that GPT-4V successfully extracts. Another point is awarded for each of the five dimensions to which it assigns an appropriate name. For the block dimensions—8.00, 5.00, and 12.00—we accept any assignment of [length, width, height] or [depth, width, height] to the three dimensions since the assignment of these labels depends on how the block is oriented. For the 4.00 dimension and the ø5.00 dimension, we expect labels of “hole depth” and “hole diameter” respectively, or equivalent names. For any dimensions GPT-4V lists beyond those shown in the drawing, we subtract 1 point for not following instructions.

The prompts for the following queries vary across the nine experiments, as indicated below.

CAD Generation 1—Prompt 3 (P3). Continuing in the same context window where P2 left off, we correct any dimensions that GPT-4V extracted from the drawing incorrectly, and we then ask GPT-4V to generate a CAD script of the block-with-hole part based on the engineering drawing provided in P1 and the dimensions it extracted in P2. For three of the experiments (experiments 1–3), we ask GPT-4V to do this using the CadQuery scripting language; for another three experiments (experiments 4–6), we ask GPT-4V to do this with a different scripting language, FeatureScript; and for the last three experiments (experiments 7–9), we ask GPT-4V to use the CAD scripting language OpenSCAD. Note that each language offers unique features and advantages:
- CadQuery: An open-source CAD scripting module built-in Python, CadQuery is easy to use for those already familiar with Python.
- FeatureScript: The scripting language of the CAD software Onshape—a free cloud-based CAD software—FeatureScript is integrated into Onshape, enabling both traditional CAD modeling and custom, script-defined, parametric modeling.
- OpenSCAD: Another open-source CAD scripting language built in C++, OpenSCAD is integrated into the CAD software FreeCAD and provides granular control over a model.
By utilizing these three scripting languages, our research aims to comprehensively assess GPT-4V’s ability to adapt to different CAD scripting environments and to evaluate its versatility in translating engineering drawings into functional CAD models.
- Scoring (6 points possible): We assign 1 point if the generated script has no errors when run. We award 1 point for each of the following features that the generated CAD possesses: the block has correct dimensions; the CAD has a hole on the largest block face; the hole is centered on a face; the hole has the correct depth; and the hole has the correct diameter. We subtract a point for each extra, incorrect feature that is present in the generated CAD (e.g. a second hole, a cutout in the block, etc.).
CAD Generation 2–5: Prompts 4–7 (P4–P7). If the CAD generated by the previous prompt has a syntax error when the code is run, we provide it to GPT-4V and ask it to fix the script. If the script runs but the generated CAD doesn’t have a perfect score, we ask GPT-4V to correct discrepancies between the generated CAD and the engineering drawing. We ask it to do this by providing it with an image with four views of the CAD generated from the previously provided script (see 4.2.1 P4 for an example). These views show the CAD with hidden lines and coordinate systems visible for each view. If the CAD is still not receiving a perfect score by P7 (CAD Generation 5), we end the experiment.
- Scoring (6 points possible for each prompt, P4–P7): Scoring for these prompts is identical to the scoring for P3.

Please note that the scoring system used here primarily serves as an illustrative example to simplify the understanding of an aggregate score for readers. Practitioners may adjust the weights assigned to each question type based on specific requirements. We have made the raw data for each response available, facilitating benchmarking and allowing for flexibility in evaluating performance when varying the importance of different factors.

[See PDF for image]

Fig. 12

Overview of the prompts and answers for one experiment of engineering drawing analysis and CAD generation using OpenSCAD. Only the three first CAD generation iterations are shown

Discussion: engineering drawing analysis and CAD generation Based on the results from P1 and P2 (see Table 6), which quantify GPT-4V’s ability to analyze an engineering drawing, we conclude that the model generally understands the content in the drawing, but struggles with interpreting the drawing’s details. For P1 in eight out of the nine experiments, GPT-4V incorrectly describes the part as a block with a hole “through” it. While it understands the part generally—a block with a hole—it does not pick up on the notation in the drawing that indicates the hole is blind rather than through. In the one experiment (experiment 4) where it received a correct score for the part description, it called the part a “rectangular block with a cylindrical hole or recess in it” and a “generic block with a hole.” While this qualifies as an accurate description, it does not demonstrate whether GPT-4V recognizes the blind hole in the drawing or not.

Table 6. Summarized results from Sect. 4.2

Experiment name	Exp 1	Exp 2	Exp 3	Exp 4	Exp 5	Exp 6	Exp 7	Exp 8	Exp 9
(P1) Part Description	0/1	0/1	0/1	1/1	0/1	0/1	0/1	0/1	0/1
8.0 Dim. extraction and label	2	2	2	2	2	2	2	2	1^a
5.0 Dim. Extraction and Label	2	2	2	2	2	2	2	2	2
12.0 Dim. extraction and label	2	2	2	2	2	2	2	2	2
ø5.0 Dim. extraction	2	2	2	2	2	2	2	2	2
4.0 Dim. extraction and label	1^a	2	2	2	2	2	2	2	1^a
Additional dim. extraction	0	− 1	0	0	0	0	0	0	0
(P2) Dimension extraction	9/10	9/10	10/10	10/10	10/10	10/10	10/10	10/10	8/10

^aThese entries which have a score of 1 always correspond with successful dimension extraction but incorrect label assignment

After being told that the part in question is a block with a blind hole (P2), GPT-4V is generally good at extracting dimensions from the drawing, receiving a perfect P2 score for six out of nine experiments (see Table 6). Across all nine experiments, GPT-4V always extracts all five dimensions from the drawing. Two-thirds of the time it assigns all of the dimensions appropriate labels. It has the most trouble naming the hole depth dimension: in experiment 1 it calls it the “height of the block (from the bottom right view)” and in experiment 9 it calls it the “width of the block.” The relative difficulty in understanding what the 4.0 dimension represents in the drawing is consistent with its initial lack of understanding (in P1) that the drawing represents a block with a blind hole. It is also interesting to note that GPT-4V is inconsistent in the labels it chooses for the three block dimensions—varying between height/width/length, depth/height/width, height/depth/width, and depth/height/length—perhaps reflecting a lack of consistent spatial reasoning or a lack of consistent norms used to label dimensions.

From the responses to P3-P7, we evaluate GPT-4V’s ability to generate CAD using CAD scripting languages. We observe, see Fig. 13, that GPT-4V rarely generates accurate CAD on the first attempt (P3), and CAD iterations (P4–P7) do not improve the CAD. For P3, only one out of the nine experiments (experiment 3, using CadQuery) leads to correctly generated CAD on the first attempt. For FeatureScript, GPT-4V cannot get out of syntax and function implementation errors for all five CAD generation iterations. The most common issue for P3 is not putting the hole on the correct face. We noticed this is because the hole extrusion direction is always linked with the dimension to which GPT-4V assigns the “height” label. The 5.0 block dimension is only assigned the height label three times, one of which is the sole experiment (experiment 3) where perfect score CAD is generated in P3.

[See PDF for image]

Fig. 13

Results of the CAD Generation prompts, CAD Generation 1—CAD Generation 5 (P3–P7). Experiments 1–3 were generated using CadQuery and Experiments 7–9 were generated using OpenSCAD. Experiment 3 was the only experiment that generated perfect score CAD on the first iteration. Experiments 4–6, the FeatureScript experiments, are not shown here, since they had persistent code errors and never generated viable CAD

From the results from P4-P7, we conclude that visual feedback of the generated design from the previous prompt does not improve GPT-4V’s CAD scripting ability. In fact, if GPT-4V generated incorrect CAD in P3, P4-P7 will never fully rectify the problematic CAD, and CAD Generation 5 (P7) will have a worse score than CAD Generation 1 (P3). A visualization of this finding can be seen in Fig. 13. For the CadQuery and OpenSCAD experiments, a general reduction in CAD score occurs in CAD Generation 3 (P5), where GPT-4V consistently forgets the dimensions that it extracted in the original engineering drawing.

In summary, we find that GPT-4V can pick up many aspects of the provided engineering drawing (e.g. general part depicted, many of the dimensions shown, etc.), but can struggle when it comes to understanding the details (e.g. recognizing the through hole, labeling the through hole dimension, etc.). GPT-4V performs poorly when it comes to CAD generation, and we demonstrate that our attempts at visual, iterative improvements are unsuccessful. These findings imply that while GPT-4V can offer some assistance in preliminary design tasks, its current capabilities are not yet sufficient for detailed, precision-driven CAD work. Building on the findings from the evaluation of GPT-4V, future research should focus on enhancing the model’s ability to interpret and process detailed engineering information. Another critical area for development is in CAD generation, where GPT-4V currently shows limitations. Future work should explore methods to improve the model’s accuracy and efficiency in creating detailed CAD models, perhaps through advanced training techniques or integration with specialized CAD software. Additionally, there’s a need to investigate how iterative feedback mechanisms can be better utilized by GPT-4V to make meaningful corrections and improvements in successive design iterations. Addressing these areas will be crucial in expanding the applicability of GPT-4V and similar VLM tools in more advanced and precision-dependent stages of the engineering design process.

In tandem with CAD generation and engineering drawing creation, engineers frequently aim to improve the design of the part, using iterative optimization approaches (e.g., using commercial tools such as nTop or SOLIDWORKS Simulation software). One of the commonly used iterative optimization approaches is structural topology optimization, which can help a designer reduce material usage while meeting some design requirements. In the next section, we turn to GPT-4V’s ability to assist with topology optimization.

Topology optimization analysis

Structural topology optimization (TO) is a numerical approach for optimizing material distribution in structures under specific constraints, aiming for efficient material use while maintaining performance. The SIMP (Solid Isotropic Material with Penalization) method, a dominant approach in TO, models material properties using a density field, adjusting it iteratively to optimize design and adhere to stress or deformation constraints (Bendsøe 1989). In mechanical systems, the minimum compliance problem focuses on finding a material density distribution, , to minimize deformation under forces and boundary conditions (Bendsøe and Kikuchi 1988). The problem is formulated as:

Here, the objective is to minimize the compliance , with F being the external load forces, the displacements of nodes and solution of the equilibrium equation , and the stiffness matrix which depends on the material distribution. The constraints include maintaining the volume fraction below a specified limit and ensuring the design variables remain within the bounds of 0 and 1, allowing for a gradient of material distribution from void to solid (Bendsøe and Kikuchi 1988; Sigmund and Maute 2013).

Optimal topologies are, however, often challenging to analyze for human experts. The topologies that result from the optimization process may be mathematically optimal, but present practical challenges in manufacturability and analysis and can be non-intuitive for human designers. We consider the analysis of topology optimization images as a test case to evaluate the use of visual-language models to assist humans in interpreting complex topologies. Note, in these experiments we do not ask the VLM to perform topology optimization, but rather to determine the inputs to or interpret the results of topology optimizations.

Volume fraction estimation

In this experiment, we task the model with calculating the volume fraction from an optimized topology depicted in an image. This involves measuring the proportion of black material in the given domain and determining the relevant ratio. The challenge is initially approached using GPT-4V’s visual analysis capabilities alone. Following this, GPT-4V employs its code interpretation abilities to address the task. We aim to obtain an accurate answer within a 5% error threshold.

Methodology In this experiment, we provide a VLM with an image of an optimized topology, and prompt the VLM to estimate the volume fraction of the structure. We provide the following contextual information: “Consider that white means the absence of material and the initial domain is a square of size 256x256,” as shown in the prompt in Fig. 14. We perform this experiment for 100 optimized designs with no floating material as well as 50 non-optimized designs which have floating material. We run the experiment with these different prompting strategies, as seen in Table 7: without domain-specific prompting (w/o Expertise), incorporating domain expertise in the prompt (w/ Expertise, i.e. “you are an expert engineer in the field of topology optimization”), and enriching the expert prompt with Chain-of-Thoughts rationale (w/ CoT, i.e. “answer using a step-by-step rationale”). The quantitative results are shown in Table 7.

Table 7. Volume Fraction Error (VFE) is the percent error when predicting the volume fraction of an optimized design

	VFE (%)	FME (%)	FME (50/50) (%)
w/o Expertise	45.71 ± 2.45	47.15 ± 1.11	42.08 ± 2.83
w/ Expertise	44.13 ± 1.11	44.71 ± 1.15	39.75 ± 2.54
w/ CoT	43.62 ± 0.77	43.90 ± 0.23	39.50 ± 2.49

Floating Material Error is for the classification task of determining if a design has floating material or not (50% error would be a random guess)

We provide quantitative experiments in Table 7, where the aim is to estimate Volume Fraction (VF) and the presence of Floating Material (FM). Table 7 presents the error metrics across 100 optimized topologies (first and second columns) and a balanced set of 50 optimized alongside 50 un-optimized topologies with randomized floating material (third column). It focuses on Volume Fraction Error (VFE) for assessing the accuracy in estimating material usage and Floating Material Error (FME) for identifying disconnected components in designs, with a baseline error expectation set at 50% for random chance. We can compare our three prompting strategies (w/o Expertise, w/ Expertise, and w/ CoT). Integrating engineering expertise into the model’s prompting strategy (w/ Expertise) improves the results, as seen by lower VFE than w/o Expertise in Table 7. However, the addition of step-by-step explanations (CoT) yields only marginal improvements in accuracy. Overall, the capacity to estimate Volume Fraction and detect Floating Material is relatively poor. This observation underlines the limitations of relying solely on a vision encoder for precise topology optimization tasks, such as estimating Volume Fraction and detecting Floating Material. It highlights the necessity of employing external analytical tools, like a code interpreter or dedicated vision modules, to achieve more accurate and reliable outcomes.

[See PDF for image]

Fig. 14

Volume fraction estimation

Discussion: volume fraction estimation

In Fig. 14, we challenge the model to quantitatively estimate the volume fraction of an optimized topology. The model accurately defines the task as “evaluating the percentage of the black area (material presence) relative to the total area of the square.” However, its initial attempts to count black pixels and calculate the material percentage yield highly inaccurate results. This inconsistency persists across multiple trials, each providing different and incorrect answers.

To address these limitations, we introduce a code interpreter (third prompt in Fig. 14), enabling the model to use a Python script for the estimation. This approach significantly improves accuracy, bringing the estimate close to the target within a reasonable margin of error. This experiment highlights two key insights: Firstly, it underscores the limitations of the vision encoder (at least for the version of GPT-4V used in this study) in handling precise quantitative assessments based on images. Secondly, it demonstrates the effectiveness of integrating coding tools in overcoming these limitations, showcasing the synergistic potential of combining AI’s interpretive capabilities with precise, code-based calculations for more accurate and reliable results.

Technical caption generation and analysis

Methodology In this experiment, we task a VLM with captioning a technical diagram using a basic prompt, with the diagram inspired by the experiment in Fig. 7 from Woldseth et al. (2022). Initially, we employ a generic prompt for captioning. Then, we enhance the task by incorporating details about the system’s technical expertise, providing a more in-depth and knowledgeable description of the diagram. These can be seen in Fig. 15. We would like to see that the model understands that small variations of constraint configurations (in this case load direction) can greatly change the optimized topology.

Additionally, we conducted a quantitative experiment to assess how effectively a VLM processes textual information within images, such as descriptive captions (Table 8). We created ten problem setups modeled after the example in Fig. 15, varying aspects of the image such as the number of elements in the x and y axes, volume fraction, load magnitude, and angle of application, along with the positions and directions of loads and the types of boundary conditions.

[See PDF for image]

Fig. 15

Technical captioning of TO images

Our focus was to evaluate GPT-4V’s ability to extract specific details from captions–information that exists solely as text within the image. We also tested the model’s accuracy in identifying the positions and directions of loads, as well as the types and placements of boundary conditions. Additionally, we examined the model’s capability to generate a functional topology optimization Python script based on the given data.

The findings indicate that GPT-4V performs well in interpreting and retrieving data from image captions. However, it shows limitations in accurately locating loads and boundary conditions. The Python scripts generated by the model were structurally sound but typically included minor mistakes in indexing and variable names, reflecting areas needing improvement in the model’s programming language generation capabilities.

Table 8. This table presents whether the model was successful (a score of 1) or not (a score of 0) at caption analysis, loads and boundary positioning, and code validation when analyzing a topology optimization image

Problem	Caption analysis						Position		Validation
Problem	nelx	nely	\|F\|	VF	R		Load	BC	Code Runs
P 1	1	1	1	1	1	1	0	0	0
P 2	1	1	1	1	1	1	0	0	0
P 3	1	1	1	1	1	1	0	1	0
P 4	1	1	1	1	1	1	0	0	0
P 5	1	1	1	1	1	1	1	1	0
P 6	1	1	1	1	1	1	1	1	0
P 7	1	1	1	1	1	1	1	1	0
P 8	1	1	1	1	1	1	0	0	0
P 9	1	1	1	1	1	1	1	0	0
P 10	1	1	1	1	1	1	0	0	0
Avg	1.0	1.0	1.0	1.0	1.0	1.0	0.4	0.4	0.0

For caption analysis, the terms nelx and nely are the number of elements in x-direction and y-direction respectively;|F| is the force module; VF is the volume fraction; R is the filter radius; is the angle of application for the load

Table 8 presents the results from testing the GPT-4V model on ten different problem setups, which are modifications of the example shown in Fig. 15. These modifications include variations in load positions and directions, domain sizes, volume fractions, and load application angles (represented as ). The table evaluates the model’s performance in three critical areas: (i) Caption Analysis. We evaluate how accurately the model interprets text within images, such as captions. (ii) Load and Boundary Positioning. We evaluate the model’s precision in identifying the correct positions for loads and boundary conditions. (iii) Code Validation for Topology Optimization. We evaluate the effectiveness in generating a usable Python script for minimum compliance topology optimization, highlighting issues like indexing errors or incorrect variable naming which prevent the script from running. Each entry in the table is scored as 1 (correct answer) or 0 (wrong answer, partially correct, or non-functional script), summarizing the model’s capability to handle textual and spatial information in images under varying complexity and setup parameters. nelx: number of elements in x-direction; nely: number of elements in y-direction; |F|: force module; VF: volume fraction; R: filter radius; : angle of application for the load.

Discussion: technical caption generation and analysis

In Fig. 15, we evaluate the model’s ability to interpret a complex diagram. The model delivers a comprehensive and accurate analysis, adeptly linking forces, their angles of application, and optimized topology. It accurately identifies the image as “a structural or mechanical analysis, demonstrating how the structure or material responds to varying load angles.” This insight into how loading direction affects topology is a correct deduction of a difficult physical problem, showcasing the model’s proficiency in understanding boundary conditions, loads, and their impact on structures.

However, the model encounters difficulty with the boundary conditions at the bottom center and right of the diagram, mistakenly interpreting them as “two hanging weights,” which is an incorrect assessment of the boundary sketch. This misinterpretation is unexpected, particularly given the overall high-quality response and the accurate grasp of the problem’s essence.

Further refining the prompt to emphasize engineering concepts, the model again provides a largely accurate response, delving deeper into topics like loads, volume fraction, and filtering radius. Yet, it repeats the same error concerning the boundary conditions, suggesting “Two external point loads, represented by weights, are applied at the bottom corners.” This persistent mistake indicates a gap in the model’s global understanding of the scenario, revealing a vulnerability to misconceptions in specific contexts.

Invalid design

Methodology The task involves identifying invalid designs, specifically floating materials, based on a given prompt. The objective is for the model to recognize, independently and without prior information, the presence of a disconnected component within a low-resolution design (64 × 64). Following this recognition, the model is expected to assess the design’s overall validity and the quality of the low-resolution grid. Lastly, we test if the model is capable of suggesting potential improvements to rectify the identified issues.

Discussion: invalid design In Fig. 16, we task the model with identifying the presence of floating material in a design, specifically a detached triangle in the top right corner. The model accurately recognizes the issue as a result of a topology optimization process, correctly noting the disconnection of the “isolated triangular shape from the main structure” in the top right.

[See PDF for image]

Fig. 16

Invalid design identification

When queried about the structure’s validity, the model identifies the floating material but its response lacks full clarity on the implications of such a flaw. While it correctly points out that “this isolated feature might be problematic to manufacture, and its disconnected nature might render it ineffective in a load-bearing role,” it fails to emphatically state that a disconnected component invariably compromises structural integrity and manufacturability. In Table 7, we present a quantitative analysis of our experiment on detecting Floating Material using the vision encoder, illustrating the complexities and challenges inherent in accurately identifying such presence or absence of small disconnected components in a broad range of cases.

Regarding the design’s optimization objectives, the model suggests a focus on “minimal material usage.” This is a common requirement in topology optimization, but it oversimplifies the broader range of performance requirements typically involved in such processes.

When asked about improving the design, the model sensibly proposes “integrating the isolated triangle into the main structure or removing it if it lacks functional benefit.” This is a valid solution to address the floating material issue. However, its subsequent recommendation to “re-evaluate boundary conditions and load cases” as a method to eliminate disconnected components is somewhat misguided. The more appropriate approach would involve refined optimization strategies and post-processing techniques.

In summary, while GPT-4V effectively identifies floating material and offers viable solutions, it falls short of fully understanding the criticality of disconnected components. Floating material or disconnected parts in a topology-optimized design invariably render it structurally unsound from an engineering point of view or unmanufacturable without further optimization or processing.

Summary

In order to assess GPT-4V’s performance in select detail design tasks, we performed evaluations on material selection using an Ashby chart, on engineering drawing analysis and CAD generation, and on supporting the understanding and analysis of topology optimization. Our findings are discussed below.

Material selection Section 4.1—Can GPT-4V effectively assist in material selection based on property charts and design requirements?
- We find that GPT-4V can be helpful at pointing out material families that meet general specifications, but that it struggles to identify materials that match specific numerical requirements.
Engineering drawing analysis and CAD generation Section 4.2—How accurately can GPT-4V extract and interpret information from engineering drawings? What is GPT-4V’s proficiency in generating and iteratively improving CAD scripts from engineering drawings?
- Only 11% of the time was GPT-4V successful in describing a block with a blind hole part. Once told it was a block with a blind hole, GPT-4V was successful 67% of the time in extracting all dimensions from the engineering drawing and assigning the dimensions appropriate names.
- Preliminary findings suggest that GPT-4V struggles with CAD generation of the block-with-blind-hole part, as it only succeeded once in nine attempts to generate correct CAD on the first try. Also, its iterative ability for CAD correction appears limited, as it doesn’t successfully correct incorrect CAD in subsequent iterations.
Topology optimization analysis Section 4.3—Can VLMs properly analyze aspects of an optimized topology, such as the volume fraction and the presence of floating material? Furthermore, can VLMs interpret images with technical captions to identify the correct positions for loads and boundary conditions?
- While GPT-4V correctly defines the concepts of volume fraction and floating materials in terms of topology optimization, Table 7 shows that GPT-4V has high error in estimating volume fraction or identifying floating material from images.
- In terms of analyzing images and captions to identify the number and location of elements like forces and boundary conditions, Table 8 shows that GPT-4V could consistently extract values from captions, however, could only identify the position of loads and boundary conditions from images 40% of the time. Lastly, GPT-4V could never generate executable topology optimization code.

We highlight the strengths and weaknesses of GPT-4V’s response by marking the more relevant and high-quality sections in green, and the incorrect, out-of-context, or low-quality parts in brown.

Manufacturing and inspection

Overview and motivation Here we focus on assessing the performance of GPT-4V in manufacturing and inspection-related tasks. Our motivation relies on the visual cues that engineers often use to understand the practical aspects of manufacturing complex geometric artifacts. This multimodal information requires expertise in understanding images as well as manufacturing knowledge. As GPT-4V shows potential for task-specific image analysis, we evaluate its potential for manufacturing and inspection. The field of manufacturing is broad and discussing the complete potential of multimodal LLMs for all manufacturing tasks is out of the scope of our work. To this end, we focus on selective manufacturing tasks that can provide useful insights in assessing the capabilities of these multimodal LLMs. Specifically, we focus on design for manufacturing (DfM) and post-manufacturing inspection tasks. Both of these topics are critical for manufacturing applications in industry and demand extensive domain-specific knowledge. We draw particular attention to understanding the manufacturability of 3D CAD models only from images. Note that manufacturability traditionally refers to the relative ease with which a part can be manufactured (Budinoff 2019; Budinoff and McMains 2021; Yang and Zhao 2015). Ensuring the manufacturability of a new part is a major challenge and requires careful analysis and expertise. The potential of an automated tool for this purpose would increase manufacturing productivity by a large margin. Multimodal LLMs may help industries build next-generation tools for automating these types of tasks. Our analysis can be thought of as an early evaluation of multimodal LLMs and their manufacturing knowledge and reasoning. For brevity, we divide the Design for Manufacturing section into two parts: additive and subtractive manufacturing. Based on existing literature, we query GPT-4V with images of 3D CAD models and assess its manufacturability response against the ground truth.

Design for manufacturing

Design for manufacturing (DfM) is a popular concept that studies the manufacturability of an engineering design (Webb 2008). The DfM field is broad, as manufacturability is dependent on the materials used, the specific manufacturing method employed (e.g. additive, subtractive, etc.), and the particular tools utilized for manufacturing (e.g. which type of 3D printer). We explore GPT-4V’s ability to assist with DfM for two popular manufacturing methods: additive and subtractive.

Design for additive manufacturing (DfAM)

Additive manufacturing (AM) has become increasingly popular as a fabrication method in recent years (Attaran 2017). AM first became popular because of its usefulness in rapid prototyping, but it is also utilized for low quantities of design-varying parts in aerospace and automotive component manufacturing (Attaran 2017). Design constraints for AM vary considerably by the additive system used, and assessing manufacturability of a design often requires some experimentation. However, through manufacturing experience, engineers and machinists often develop manufacturing guidelines or rules for designing a part to be manufactured with a specific process. It would be challenging to quantitatively assess GPT-4V’s ability to predict the 3D-printability of part, because just as understanding of manufacturability varies from person to person, it would vary from person to model. However, we can provide the model with a set of unambiguous rules pertaining to 3D printability and ask the model to assess the printability of a part based on those rules. Hubs, a ProtoLabs company that offers on-demand manufacturing, created a chart, entitled “Design rules for 3D Printing,” encoding common design rules for AM based on printer type (see Fig. 19) (Hubs 2023). There are ten specific design rules for FDM-printed parts. For example, one rule states that supported walls for an FDM printer can have a minimum thickness of 0.8 mm. These rules are heuristics and exceptions can be found, but the chart enables us to assess GPT-4V’s ability to apply common fabrication rules to a design.

Methodology We assess GPT-4V’s ability to understand and apply AM design rules by asking the model to predict success in 3D-printing various designs. For this task, we created a set of 20 designs, split into two sets: one set of problematic designs and another set of manufacturable designs, see Fig. 17. For each of the ten design rules in The Hubs chart that pertain to FDM manufacturing, we created the ten problematic designs, each of which violates one of the ten rules. The other ten designs comprising the manufacturable designs set are similar to problematic designs but actually pass all of the FDM rules in The Hubs chart. To confirm the intended manufacturability of the ten designs in the manufacturable set, we 3D-printed them using a Carbon X1 Bambu printer. All ten designs were printed successfully as shown in Fig. 18.

[See PDF for image]

Fig. 17

(A–J) Ten problematic designs, where each design violates one of the FDM AM rules on The Hub’s chart for “Design rules for 3D printing.” The specific rule violated is noted below each design. (K–T) Ten manufacturable designs, each based on one of the problematic designs

[See PDF for image]

Fig. 18

Parts from Fig. 17 (K–T) that we 3D-printed using a Carbon X1 Bambu printer

We carried out 20 queries, each in a new context window and each corresponding with one of the 20 designs. For each query, we provided GPT-4V with the chart of the design rules and a dimensioned image—one of the twenty images shown in Fig. 17—of the design we desired to print. We then asked GPT-4V, based on the provided design rules, to predict the success of 3D printing the part using an FDM printer. We asked the model to point to the specific design rule(s) violated if it believed the part would not print successfully. Sample queries can be seen in Fig. 19. To check repeatability, each of these queries was repeated three times for a total of 60 queries. We scored each response as follows:

Manufacturable? (max score 1): If GPT-4V correctly answered if the part was manufacturable or not, we assigned a score of 1, otherwise, 0.
Correct rule (max score 1): This scoring metric is only applicable to designs in the problematic design set. If the rule that was in violation was named in GPT-4V’s response, we assigned a score of 1, otherwise, 0.
# Incorrect rules (max score 0): This scoring metric is only applicable to designs in the problematic design set. The number of rules GPT-4V mentioned that the model believed were violated, but which were truly not violated, corresponds with the negative value of this score. For example, if three rules were mentioned by GPT-4V which were not violated by the design, the score for this metric would be -3.

A summary of all our results can be seen in Table 9.

[See PDF for image]

Fig. 19

Example prompts and answers for the design for additive manufacturing

Table 9. Scores achieved by GPT-4V on the design for additive manufacturing experiments across three trials

Design #	Manufacturable?			Correct Rule			# Incorrect Rules			Score
Trial #	1	2	3	1	2	3	1	2	3	1	2	3
Design A	1	1	1	0	0	0	− 4	− 3	− 5	− 3	− 2	− 4
Design B	1	1	1	1	0	1	− 1	− 2	− 2	1	− 2	0
Design C	1	1	1	1	1	1	− 2	− 1	− 3	0	1	− 1
Design D	1	1	1	0	0	0	− 3	− 3	− 3	− 2	− 2	− 2
Design E	1	1	1	1	1	1	− 1	− 1	− 2	1	1	0
Design F	1	1	1	1	1	1	− 1	− 1	− 1	1	1	1
Design G	1	1	1	1	1	1	− 2	− 4	− 2	0	− 2	0
Design H	1	1	1	0	0	0	− 2	− 1	− 3	− 1	0	− 2
Design I	1	1	1	0	0	0	− 2	− 3	− 2	− 1	− 2	− 1
Design J	1	1	1	0	0	0	− 1	− 2	− 2	0	− 1	− 1

Scores for Designs K-T are not shown in the above table, since GPT-4V always (across all three trials) incorrectly predicted those designs to be not manufacturable, even though they came from the manufacturable set. As such, the Manufacturable? score for Designs K-T for all trials is 0

Design for subtractive manfuacturing

Subtractive manufacturing is the most widely used manufacturing technology in the industries for manufacturing complex parts. This design process requires careful attention to the manufacturability of a part and typically this process is iterative. This is particularly challenging for parts with interacting features (Gao and Shah 1998). Unfortunately, there is a very limited number of datasets for this task in the literature. Recently, deep learning-based approaches have been implemented to identify machining features using synthetic CAD datasets (Cao et al. 2020; Zhang et al. 2018). These datasets are created using a curated set of design principles. To this end, we utilize the MFCAD dataset (Cao et al. 2020) to query GPT-4V for manufacturing feature recognition from the image of a CAD model.

Methodology We perform a quantitative study based on multiple queries to GPT-4V. We randomly pick 50 samples from the MFCAD dataset and create the images of each CAD model. Each of these images corresponds to a ground truth that assigns machining features to each surface of the CAD model. In general, there are 15 possible machining features that we test in all of the experiments which excludes the stock material block. The list of machining features is the following: rectangular through slot, triangular through slot, rectangular passage, triangular passage, 6 sided passage, rectangular through step, 2 sided through step, slanted through step, rectangular blind step, triangular blind step, rectangular blind slot, rectangular pocket, triangular pocket, 6 sided pocket, chamfer. We query GPT-4V with each of these images and ask for the machining features that are present in the design. First, we provide an initial prompt to focus on design for manufacturing. Next, we query GPT-4V about each of the images, as shown in Fig. 20, sequentially. Figure 21 shows two example prompts and the corresponding responses from GPT-4V.

[See PDF for image]

Fig. 20

Machining feature recognition from CAD images: results are shown for eight selected samples where each ground truth (GT) is also shown corresponding to the GPT-4V response

[See PDF for image]

Fig. 21

Example prompts and answers for the feature identification in subtractive manufacturing

The dataset used in this experiment is provided with this document as an open-source small-scale evaluation dataset for vision-language LLMs. This design for the subtractive manufacturing dataset is mainly based on the MFCAD dataset (Cao et al. 2020) and consists of 15, 488 images of CAD models and their corresponding machining features as labels. We only show results for 8 image-label pairs in this document as the inaccuracy of GPT-4V makes it difficult to quantify the performance. We anticipate that this dataset can be useful in evaluating a more capable VLM that can understand 3D geometry and engineering design images.

Initial prompt: I am going to ask you a series of questions about some machining feature recognition from an image of a stock of material.
Image prompt: Here are the machining features, from an image you need to identify which machining features are present in the stock of material in the image. – List of machining features: Rectangular through slot, Triangular through slot, Rectangular passage, Triangular passage, 6 sided passage, Rectangular through step, 2 sided through step, Slanted through step, Rectangular blind step, Triangular blind step, Rectangular blind slot, Rectangular pocket, Triangular pocket, 6 sided pocket, Chamfer, Stock

We repeat these 50 queries three times and obtain similar responses. GPT-4V identifies at least one feature in most images but fails to consistently identify features.

Discussion: design for manufacturing Overall, we note that GPT-4V never successfully answers all parts of any of our DfM queries. While the model is able to correctly answer pieces of our questions (e.g. that the design breaks an AM rule, that the design contains a certain machining feature), its answers are never fully accurate. In particular, we notice that GPT-4V sometimes struggles to or forgets to follow directions specified in a prompt, and its performance deteriorates as the complexity of designs increases.

In terms of its ability to predict the success of additive manufacturability based on provided design rules, GPT-4V always states that the provided design will not be able to be successfully produced using AM. The response that the part would not be able to be 3D-printed successfully was consistent across all 60 queries, for both the problematic and manufacturable design sets. In other words, for all the designs that were manufacturable—and which violated no FDM design rules in the chart—GPT-4V hypothesized that they would break one of the 3D printing design rules. This consistently negative response to printability likely reflects a cautious posture on the part of the model. We also observed from the data in Table 9 that GPT-4V always maintains that multiple design rules are broken, while all designs in the problematic design set violate just one of the rules listed in the Hub design rule chart. As such, GPT-4V is never fully correct in answering any of our questions about additive manufacturability based on the provided design rules. GPT-4V’s listing of many rules in response to our question about rules violated could further reflect its tendency to take on a cautious position. Less than half the time (13/30 queries) is GPT-4V able to correctly identify the violated rule for the problematic designs. We also note that the model sometimes seems to get confused by and/or forgets our ask to name the rules by the numbers we assigned to each one in our prompt (see Fig. 19). As seen in Fig. 19, the model lists the first rule by the correct number but then lists the second rule by an incorrect number.

Overall, in the context of additive manufacturing tasks, the implications of using GPT-4V are nuanced. The model consistently predicts that designs will not be successfully produced using additive manufacturing methods, regardless of their actual manufacturability. This uniform negativity indicates a cautious approach, likely to avoid over-optimistic assessments, but also leads to an overestimation of manufacturing challenges. GPT-4V’s tendency to list multiple broken design rules, even when only one is violated, further reflects its caution. However, this approach can be misleading in real-world scenarios where precise and accurate manufacturability assessments are crucial. The model’s difficulty in following specific prompt instructions, such as correctly identifying rules by assigned numbers, points to a need for further development in its ability to process and respond to detailed additive manufacturing queries. While GPT-4V’s partial answering capability suggests a basic understanding of additive manufacturing principles, its current limitations underscore that it is not yet a reliable tool for comprehensive and accurate manufacturability assessments in additive processes.

In subtractive manufacturing tasks, GPT-4V demonstrates an ability to identify at least one machining feature in most images (12/20), but its performance is inconsistent, particularly with more complex designs. For example, GPT-4V often identifies ‘triangular through slot’ instead of ‘rectangular through slot’ and ‘2-sided through step’ instead of ‘6-sided passage’. The model also misidentifies distinct features, such as confusing a ’triangular through slot’ with a ’rectangular through slot’, and exhibits challenges in understanding more intricate geometric features. This inconsistency in feature identification can lead to unreliable assessments in scenarios where precision in subtractive manufacturing is essential. While GPT-4V seems to fare better with simpler geometric objects, its difficulty with complex objects suggests that its current use might be more suitable for preliminary assessments or educational purposes, rather than for detailed, technical manufacturing evaluations. The somewhat random nature of its explanations and inability to satisfy detailed engineering design concerns indicates that significant improvements are necessary before GPT-4V can function as a stand-alone tool in subtractive manufacturing tasks. As such, while GPT-4V can provide some support in these tasks, it requires careful human oversight and verification to ensure accuracy and relevance in practical manufacturing scenarios.

Based on the study’s insights into GPT-4V’s performance in Design for Manufacturing tasks, future work should focus on enhancing the model’s precision and depth of understanding in both additive and subtractive manufacturing processes. For additive manufacturing, research should aim to calibrate GPT-4V’s cautious approach, enabling it to differentiate between manufacturable and non-manufacturable designs more accurately, and follow specific guidelines more precisely. In the realm of subtractive manufacturing, efforts need to be directed toward improving GPT-4V’s ability to consistently and correctly identify complex machining features. This includes training the model to handle a broader range of geometries and intricate design elements, thus reducing its current limitations in assessing detailed and technical aspects of manufacturing designs. Additionally, developing a better way for AI models to understand 3D geometry could enhance GPT-4V’s interpretative capabilities, leading to more reliable and practical applications in the manufacturing sector. These advancements would not only make GPT-4V a more robust tool for manufacturing design but also pave the way for its broader application in automated manufacturing processes. In the next section, we turn to another application of GPT-4V for manufacturability: post-manufacturing inspection.

Post-manufacturing inspection

Engineering inspection constitutes a whole domain within itself: parts must be inspected after they are fabricated to ensure that they meet certain technical requirements, and for critical components, inspection can continue into the lifetime of the part. Inspection is a key aspect of the engineering design process, as it may help in improving the next iteration of the design. Oftentimes, inspection necessitates a visual component (e.g., detection of a defect through an image, X-ray, graph of collected data, etc.) alongside extensive engineering knowledge of detailed documents, like engineering standards. As such, we are curious to understand if GPT-4V, with multimodal capabilities, can aid engineers with defect detection in images.

Methodology For our analysis, we use the CODEBRIM (Concrete DEfect BRidge IMage) dataset, released by Mundt et al. (2019). It entails images of structural concrete from bridges that contain none or some of the following defects: cracks, spallations, efflorescence, exposed bars, and corrosion strains. Sample images from the dataset can be seen in Fig. 22. We chose a subset of 23 images from the CODEBRIM dataset for our experiments. The images were chosen such that each of the five defect types was present in at least five images. Five of the 23 images were “background” images, containing no defects. We provided GPT-4V each image in a separate context window4 and asked the model to identify any of the five defects it could find. If the model was hesitant to respond—due to image resolution or safety concerns—but still suggested certain defects, we counted that as a response. To understand repeatability, each of the 23 image experiments was repeated three times, for a total of 69 queries. Two queries and responses can be seen in Fig. 23. The results for all experiments can be seen in Tables 10, 11, 12, 13, 14.

[See PDF for image]

Fig. 22

Bridge structural concrete images adapted from the CODEBRIM dataset (Mundt et al. 2019) under its specific license. From left to right, as named in the original dataset: (1) image_0000005_crop_0000001.png—contains efflorescence and corrosion stain defects. (2) image_0000046_crop_0000001.png—contains crack defects. (3) image_0000109_crop_0000003.png—contains spallations and corrosion strain defects. (4) image image_0001189_crop_0000004.png—contains exposed bar defects

[See PDF for image]

Fig. 23

Example prompts and answers for the concrete defect classification

Discussion The confusion matrices (Tables 10, 11, 12, 13, 14) provided for different defect types in structural concrete offer insights into GPT-4V’s defect detection capabilities. For 12 experiments (8 different images), GPT-4V would not provide an answer to our question, citing resolution issues, safety concerns, or just plainly stating that it could not assist with the request. When it did answer, GPT-4V did not perform particularly well in predicting types of defects, as seen by the relatively low F1 scores for each class of defect. We also note that GPT-4V tends to over-predict the crack defect; this is evidenced by the relatively high recall (true positive rate) score of 0.79 and a relatively low specificity (true negative rate) score of 0.44 when compared with the other classes. Perhaps most familiar with the crack defect, GPT-4V may over-predict cracks out of an abundance of caution and the safety implications of missing a true positive defect in concrete images of a bridge. For 14 out of the 69 queries, GPT-4V had a perfect defect prediction of all defect classes in the image. Intriguingly, half of these perfect scores were for images without any defects, suggesting that GPT-4V might be more adept at discerning the absence of defects rather than accurately classifying the type of defect present.

Table 10. Confusion matrix for crack defects as predicted by GPT-4V

	True positive	True negative
Predicted positive	11	24
Predicted negative	3	19

score is 0.45

Table 11. Confusion matrix for spallation defects as predicted by GPT-4V

	True positive	True negative
Predicted positive	11	3
Predicted negative	8	35

score is 0.67

Table 12. Confusion matrix for efflorescence defects as predicted by GPT-4V

	True positive	True negative
Predicted positive	3	4
Predicted negative	12	38

score is 0.27

Table 13. Confusion matrix for exposed bar defects as predicted by GPT-4V

	True positive	True negative
Predicted positive	9	5
Predicted negative	6	37

score is 0.62

Table 14. Confusion matrix for corrosion stain defects as predicted by GPT-4V

	True positive	True negative
Predicted positive	15	10
Predicted negative	6	26

score is 0.65

The findings of this study have significant implications for the application of AI in engineering inspection tasks. While GPT-4V demonstrates potential in identifying defects in structural concrete, its moderate performance underlines the need for further model refinement and continued reliance on human expertise or more specialized machine learning tools. The model’s ability to discern the absence of defects could be leveraged in preliminary inspections to streamline processes, yet the necessity for human verification remains paramount, especially in safety-critical assessments. These results suggest avenues for future research focused on improving AI accuracy through diverse training datasets and approaches of expert feedback.

Summary

To assess the manufacturing-related knowledge of GPT-4V, we performed three types of experiments. A concise summary of our findings for each are provided below.

Design for additive manufacturing Section 5.1.1—In the realm of additive manufacturing, does GPT-4V consistently predict the 3D printability of a design based on a set of provided DfAM rules?
- GPT-4V uniformly (in all instances) indicated that designs would not be suitable for 3D printing. This conclusion was drawn irrespective of whether the designs actually conformed to the specified additive manufacturing rules.
Design for subtractive manufacturing Section 5.1.2—Is GPT-4V capable of identifying manufacturing features in subtractive manufacturing designs?
- GPT-4V exhibited a basic grasp of feature geometries but lacked consistency in its responses. The model struggled to differentiate between similar features and frequently resorted to making arbitrary guesses.
Post-manufacturing inspection Section 5.2—To what extent can GPT-4V accurately classify different types of defects in images, specifically in the context of identifying concrete defects in manufacturing?
- Based on our experiments with concrete defect classification, we find that GPT-4V may have the potential to distinguish between images that have defects and images that do not. However, it was unable to consistently and accurately classify different types of concrete defects.

Engineering textbook problems

Overview and motivation In this last section, we take a step back from the product development process and investigate GPT-4V’s abilities to solve problems that are present in engineering curricula. During their curriculum, students are regularly asked to solve engineering design problems that require them to interpret sketches, graphs, tables, and images to answer a related question. As such, students need to integrate their natural language processing and visual information understanding skills with their domain knowledge to solve this type of problem. The underlying idea is that these are tasks and assignments used to evaluate humans’ readiness to be engineers. Consequently, they may enable us to draw some comparison with GPT-4V’s readiness to support engineering tasks. Textbook problems, exam questions, and standardized tests have been quite popular ways to evaluate LLMs (Katz et al. 2023; Wang et al. 2023). These problems are often well-defined, self-contained, and mostly closed-form type questions (Taraban 2011), supporting replicability (Zong and Krishnamachari 2023). For example, for text input only, SciBench (Wang et al. 2023) features 695 collegiate-level textbook problems drawn from physics, chemistry, and mathematics. Using this benchmark, SciBench (Wang et al. 2023) aimed at evaluating the reasoning and problem-solving skills of LLMs.

Following a similar approach, we propose to use engineering textbook problems requiring visual information to evaluate GPT-4V’s understanding and problem-solving capabilities through the pairing of different visual and textual information.

Methodology We gathered questions from two undergraduate engineering design classes publicly available under CC-BY-NC-SA on MIT OpenCourseWare (Frey and Gossard 2009; Chun and Kim 2004). The class materials include problem sets and exams. All class materials come with model solutions, which we use as ground truth. To ensure that we are evaluating GPT-4V’s multimodal capabilities, we select questions that reference one or more pictures in the question prompt. We ignore questions that require students to annotate an image as GPT-4V cannot generate images, but questions asking for sketches are included. Indeed, sketches can be parameterized and drawn using coding languages.

To ensure independence, we reset GPT-4V’s context window for each question except for multi-part questions, where we prompt GPT-4V sequentially. For multi-part questions that have multiple images, we supplement each sub-question with only the images required to solve that particular question to avoid confusing GPT-4V with superfluous information. For example, consider a multi-part question has two images X and Y, and part (a) only requires X to solve, part (b) requires Y to solve and part (c) requires both X and Y to solve, we would supplement question (a) with X, question (b) with Y and question (c) with both X and Y.

We evaluate GPT-4V’s correctness based on a binary scale: one point is only given for fully correct answers, and none otherwise. Being “fully correct” means outputting an answer that is semantically similar to the ground truth for free-text questions. For questions involving calculations, the correct numerical answer must be provided and the intermediate steps should reasonably lead to the correct solution. For multi-part questions, we award a point for each correct part. We group the errors into three categories:

Reasoning: Incorrect explanation or calculation.
Inference: Incorrect information extraction from the image.
Imprecise: Vague answer or explanation without execution.

The questions were repeated three times to account for the variability of the model and, overall, a question was considered correctly answered if at least two repeats were correct.

Can GPT-4V Solve Engineering Textbook Problems? We extracted 21 questions from two classes, resulting in a total of 44 questions when counting each sub-part individually. We observe that GPT-4V can answer 16 of these 44 questions correctly, giving an average 36% accuracy. An overview of all the repeats and questions is provided in Table 17. Relatively to the type of image, GPT-4V answered correctly most questions involving 3D models and tables (63% and 67% respectively) but had a lower success rate for photographs (33%), diagrams (29%) and graphs (0%), see Table 15. In terms of question format, GPT-4V performed slightly better on free-text questions (44%), than on any other format, see Table 16. Overall, we observe that GPT-4V makes mostly reasoning errors (20), followed by imprecise answers (5) and inference errors (3). Thus, it seems to be more helpful for questions that require explanations and for problems that ask questions about tables or 3D models.

Table 15. Summary of GPT-4V’s score on textbook problems grouped by type of image

	Photo	Diagram	Graph	3D	Table	Overall
Correct	1	8	0	5	2	16
Total	3	28	2	8	3	44
Avg. (%)	33	29	0	63	67	36

Table 16. Summary of GPT-4V’s score on textbook problems grouped by question format

	Free text	MCQ	Numerical	Draw	Overall
Correct	7	0	8	1	15
Total	16	1	24	3	44
Accuracy (%)	44	0	33	33	36

Selected questions and answers In the following, we reproduce selected questions and answers to illustrate the type of questions, as well as the type of errors in the answers.

First, we look at an example of imprecise answers. Figure 24 shows the question and answer to Q1 (a–c). Although GPT-4V can describe relationships between stall torque, no-load speed, and maximum power, it fails to provide the exact proportions by which the relationships increase or decrease. For Q1, the expected solution is that by doubling the number of windings, the stall torque doubles, the no-load speed is cut in half and the maximum power stays constant. Noteworthy, the provided answer also contains additional explanations that were not asked for.

[See PDF for image]

Fig. 24

A diagram-based textbook problem about motor parameters. Q1—Repeat 1

Next, we look at Q2, which is the only multiple-choice question, see Fig. 25. The correct answer to this question is (d) since the capacitor is already charged up to the supply voltage and cannot unload through the LED. GPT-4V seems unable to understand the circuit based on the photograph and thus, bases its responses solely on the provided text. While it understood that the LED and the capacitor were in series, it got some basic physics concepts incorrect.

[See PDF for image]

Fig. 25

Reasoning on an electric circuit based on a photograph. Q2—Repeat 1

Finally, we look at a question that requires extracting values from a table and performing calculations using them, see Fig. 26. While arguments can be made to input tabular data as text, tables are often inconvenient to input and are more convenient for users to input as images. In this particular answer (repeat 3), GPT-4V is able to correctly extract the values and calculate the center distance, the torque, and the reaction force. It is worth keeping in mind, see Table 17 that the two other repeats were not as successful, showing how challenging such a task is.

Table 17. Detailed list of questions including image type and question format and answers by GPT-4V for each trial, along with the type of error

	Image Type	Format	#1 / Error	#2 / Error	#3 / Error	Overall
Q1-a	Diagram	Free text	Imprecise	Imprecise	Imprecise
Q1-b	Diagram	Free text	Imprecise	Imprecise	Imprecise
Q1-c	Diagram	Free text	Imprecise	Imprecise	Imprecise
Q2	Photograph	MCQ	Reasoning	Reasoning	Reasoning
Q3-a	3D-model	Free text
Q3-b	3D-model	Free text
Q4-a	Diagram	Numerical		Reasoning	Reasoning
Q4-b	Diagram	Numerical	Inference	Inference	Inference
Q5	Diagram	Draw	Imprecise	Imprecise	Imprecise
Q6	Photograph	Numerical
Q7	Photograph	Numerical	Reasoning	Reasoning	Reasoning
Q8-a	3D-model	Free text		Reasoning	Reasoning
Q8-b	3D-model	Free text	Reasoning	Reasoning
Q8-c	3D-model	Draw	Imprecise	Imprecise	Imprecise
Q9-a	Table	Numerical		Inference
Q9-b	Table	Numerical	Reasoning
Q9-c	Table	Numerical	Reasoning	Reasoning
Q10	3D-Model	Free text
Q11	Diagram	Numerical	Reasoning	Reasoning	Reasoning
Q12-a	Diagram	Numerical	Reasoning	Reasoning	Reasoning
Q12-b	Diagram	Numerical	Reasoning	Reasoning	Reasoning
Q12-c	Diagram	Numerical	Reasoning	Reasoning	Reasoning
Q12-d	Diagram	Free text
Q13	Diagram	Numerical	Reasoning	Reasoning	Reasoning
Q14-a	Diagram	Numerical	Reasoning	Reasoning	Reasoning
Q14-b	Diagram	Free text	Reasoning	Reasoning	Reasoning
Q14-c	Diagram	Free text
Q15-a	Diagram	Numerical	Reasoning	Reasoning	Reasoning
Q15-b	Diagram	Numerical	Reasoning	Reasoning	Reasoning
Q16-a	Diagram	Draw
Q16-b	Diagram	Free text
Q16-c	Graph	Free text		Inference	Inference
Q16-d	Diagram	Numerical	Reasoning
Q16-e	Diagram	Numerical	Reasoning	Reasoning	Reasoning
Q17-a	Diagram	Numerical
Q17-b	Diagram	Numerical	Reasoning	Reasoning	Reasoning
Q17-c	Diagram	Numerical
Q18-a	Diagram	Numerical	Reasoning		Reasoning
Q18-b	Diagram	Numerical	Reasoning	Reasoning	Reasoning
Q19-a	Diagram	Free text	Reasoning	Reasoning	Reasoning
Q20-a	Diagram	Free text
Q20-b	Diagram	Free text	Inference	Inference	Inference
Q21-a	3D-Model	Numerical
Q21-b	3D-Model	Numerical
Overall score						16 (36%)

[See PDF for image]

Fig. 26

Calculating answers using information from a table. Q9—Repeat 3

Discussion: Textbook Problems

As previously mentioned, GPT-4V makes three types of mistakes: reasoning, image misinterpretation, and imprecision. We go into detail below.

Reasoning GPT-4V can sometimes provide incorrect answers as a result of reasoning errors. This was especially apparent with multi-step reasoning tasks. When asked what happens to a system after a series of actions are performed, such as the one given in Fig. 25, GPT-4V often makes a mistake in reasoning, such as hallucinating a fact, and derails its chain of thought. Furthermore, When provided a question to compute a numerical answer, GPT-4V can sometimes have the correct methodology for arriving at the correct answer but provides an incorrect answer due to an incorrect numerical approximation. For example, when asked to compute , GPT-4V instead used as an approximation and arrived at an incorrect answer even though its previous steps were correct. This issue is not difficult to alleviate, however, as previous papers have shown that leveraging tools such as calculators can enable GPT-4V to perform numerical reasoning tasks better.

Image misinterpretation GPT-4V can have trouble understanding and inferring information in images. In Fig. 25, GPT-4V fails to interpret the circuit shown as a photograph. As a result of this misinterpretation, GPT-4V cannot leverage the information present in it and only answers based on the provided text.

Imprecision While GPT-4V has reasonable success in providing qualitative answers, it sometimes fails to do so as its answers are too vague or do not capture the main idea of a question. In a question that asked about potential errors in an injection molding experiment, GPT-4V provided a long list of potential issues, but not issues that were specific to the question. GPT-4V also had issues with relating exact numerical relationships between different variables as seen from Fig. 24.

Summary

Here we provide a summary of our findings regarding GPT-4V’s ability to solve engineering textbook problems.

High level reasoning Is GPT-4V able to reason about domain-specific at a high level?
- We evaluated GPT-4V on 16 free-text questions, which ask about high-level reasoning knowledge on Mechanical Engineering topics. Of the 5 types of questions, GPT-4V performs relatively better in this area compared to other tasks. Free-text questions tend to be more open-ended and require less precise reasoning than other types of questions. As a result, GPT-4V can provide answers that are considered correct even though they may not fully correspond to the model solution.
Numerical reasoning How strong is GPT-4V at numerical reasoning tasks?
- We evaluated GPT-4V on 24 different numerical reasoning tasks and it answered 8 of them correctly. We note that GPT-4V’s inability to answer numerical questions comes from two sources. First, it is unable to precisely compute answers, leading future steps in the computation to deviate from the answer. Second, it is unable to logically incorporate domain knowledge during its reasoning process, causing it to use incorrect formulas.
Failure modes When does GPT-4V fail to provide satisfactory answers?
- We classified the failure modes of GPT-4V into three cases: reasoning, inference, and imprecision. Reasoning errors happen either due to failing complex, precise logical steps or incorporating domain knowledge. This was especially apparent for multi-step reasoning tasks. Inference errors happen when GPT-4V fails to incorporate image information into its answers. Imprecision errors happen when GPT-4V is unable to calculate correct answers as alluded to in the above section.

Spatial reasoning abilities

Mental rotation and packing tests

Spatial reasoning is the ability of humans to perform mental spatial operations: rotations, translation, projection, and orientation. Spatial reasoning is at play when humans read maps, navigate their homes at night without light, or solve pretty much any problem in the fields of science, technology, engineering, and mathematics (STEM) (Maeda and Yoon 2013). Spatial reasoning skills are considered essential skills for understanding graphs, diagrams, plots, 3D objects, and representations. Indeed, multiple studies have found that spatial abilities are a good predictor of academic success (Shea et al. 2001; Berkowitz and Stern 2018). Consequently, spatial reasoning skills have been well studied in humans, and many standardized tests exist, e.g., The Revised Purdue Spatial Visualization Test: Visualization of Rotations (PSVT:R) (Yoon 2011), the Mental Cutting Test “Schnitte” (Quaiser-Pohl 2003), or the Novel Spatial Ability Tests (Berkowitz et al. 2021).

Following some of our observations on the apparent struggles of GPT-4V regarding spatial understanding, we specifically tested its spatial abilities in order to provide additional insights. Spatial reasoning tests are also good candidates to evaluate vision language models since they focus on inherently visual tasks and are often not publicly available, to maintain their validity. They are thus unlikely to be part of the training data.

Methodology We assessed GPT-4V’s spatial reasoning skills using the openly accessible packing test (part of the Novel Spatial Ability Tests (Berkowitz et al. 2021)), and the MechE Rotation Test (Picard 2023). While the first one is openly accessible, the latter is released publicly for the first time in parallel to this work.

The MechE Rotation Test follows the general principles of the PSVT:R, but uses objects with features typically seen on mechanical parts. It measures the ability of participants to visualize one or two rotations applied to a reference object and apply them to another object. For each question, five possible configurations of the object are shown and the participants select the correct one. The test is composed of an example—for which the correct answer is given to the participant—followed by ten questions of increasing difficulty. The packing test requires participants to evaluate if shapes can be composed of or decomposed into smaller sub-shapes. The packing test is split into two parts: packing and unpacking. In the first part, participants have to choose among four options which set of sub-shapes can be packed together to form a larger shape. In the second part, participants do the opposite and select among four large shapes, which can be decomposed into the provided smaller shapes. The example questions for these tests are shown in Fig. 27.

[See PDF for image]

Fig. 27

Example questions from the considered spatial reasoning tests

In this work, we have GPT-4V take the tests as they are given to human participants: We type the exact textual instructions from these standardized tests and include the corresponding images. Each questionnaire is passed in a single context, sequentially going through the examples and the questions, and providing the instructions and images. To account for some stochasticity, each questionnaire is repeated five times. In addition, and inspired by Yang et al. (2023a), we evaluate if adding visual marks—reference coordinates and coloring faces—improved the performance of the model on the MechE rotation test. The runs using the original test are referred to as Run H, while the ones with the marks are called Run P.

The full set of prompts is made available as a benchmark for any future vision language model (Fig. 28).

[See PDF for image]

Fig. 28

Example of the MechE rotation test with the additional visual prompts to support the model

Scores on the spatial reasoning tests The answers of GPT-4V for the packing test and the MechE rotation test are provided in Tables 18 and 19, respectively. Starting with the packing test, GPT-4V obtains an average score across five runs of 36%, slightly higher than the expected average score if answering at random (25%). Interestingly, all five questions that have been answered correctly have been done so by at least two runs, further suggesting that GPT-4V is not answering at random. In comparison to humans, however, it remains significantly lower than the average score of undergraduate (66%) and graduate (73%) students reported by Berkowitz et al. (2021). For the MechE rotation test, the average scores (16% and 20%) are lower and closer to the expected score for random answering (20%). While slightly higher, it is unclear if the visual prompting supports GPT-4V. While no human results have been published for this test, average scores between 60% and 70% are expected based on internal tests and by comparison to the revised PSVT:R test.

Table 18. Answers and scores for the Packing Test (Berkowitz et al. 2021)

(Correct)	Run 1	Run 2	Run 3	Run 4	Run 5
Part 1 Q1 (4)	3	2	3	3	3
Part 1 Q2 (3)	4	1	4	4	4
Part 1 Q3 (3)	3	3	3	3	3
Part 1 Q4 (2)	2	1	2	2	2
Part 1 Q5 (1)	3	4	4	4	3
Part 2 Q1 (3)	3	3	2	3	3
Part 2 Q2 (3)	4	1	1	4	1
Part 2 Q3 (2)	2	4	3	2	2
Part 2 Q4 (2)	1	3	1	3	1
Part 2 Q5 (3)	3	1	4	4	3
Scores	50%	20%	20%	40%	50%
Average					36%

Correct answers are in bold. Each run was conducted within the same context

Table 19. Answers and scores to the MechE Rotation Test (Picard 2023)

(Correct)	H1	H2	H3	H4	H5	P1	P2	P3	P4	P5
Q1 (D)	C	C	C	D	D	D	D	C	C	B
Q2 (A)	D	B	A	D	C	D	B	D	B	D
Q3 (C)	C	C	C	E	B	C	D	B	C	C
Q4 (E)	C	D	A	A	C	C	C	D	B	D
Q5 (C)	C	B	A	B	B	D	B	B	C	C
Q6 (B)	E	D	C	D	C	D	E	E	D	E
Q7 (C)	E	E	D	C	D	E	C	C	B	D
Q8 (A)	B	C	C	B	E	C	D	B	C	C
Q9 (E)	D	B	A	C	B	C	A	C	B	E
Q10 (A)	C	C	C	B	D	B	E	B	C	B
Scores	20%	10%	20%	20%	10%	20%	20%	10%	20%	30%
Average					16%					20%

Correct answers are in bold. Each run was conducted within the same context

To gain more insights, run P1 is reproduced in Fig. 29. GPT-4V’s answers seem to indicate that the nature of the test and the task is well understood. However, while the answer is correct, the reasoning is incorrect. Based on the provided coordinate system, the reference object is rotated by around the X-axis, and not the Z-axis as stated by GPT-4V. This type of behavior has already been reported regarding numerical reasoning (Stechly et al. 2023). As such, it seems like additional visual or textual instructions are needed to properly root the model within a spatial system.

[See PDF for image]

Fig. 29

First two prompts for the run P1 of the MechE rotation test, with the corresponding answers by GPT-4V. Both prompts are executed consecutively in the same context

Discussion Overall, our evaluation of the spatial abilities of GPT-4V using standardized (human) tests tends to suggest that, compared to humans, GPT-4V has some, although limited, spatial reasoning capabilities. Indeed, while these visualization tasks are hard and constructed to be somewhat deceptive, most untrained undergraduate students in science and technical fields answer at least half of the questions correctly (Yoon 2011; Berkowitz et al. 2021). Unfortunately, this seeming lack could, in part, explain GPT-4V’s limitations in performing engineering design tasks, such as CAD generation, see Sect. 4.2. These results also corroborate findings recently reported in the literature (Wen et al. 2023).

Summary

Our evaluation of GPT-4V’s performance on standardized spatial reasoning tests is summarized below.

Standardized spatial reasoning tests Can GPT-4V correctly answer multiple-choice questionnaires created to assess spatial reasoning skills in humans?
- We repeated the MechE rotation and the packing tests five times each for a total of 100 multiple-choice questions. The model’s answers are similar success rates expected for the number of options in these questions.
- Further, we investigated a visual prompting approach based on (Yang et al. 2023a) on the MechE rotation test aiming to assist the model. Repeated five times, we see no improvement in model performance over the unmodified test.

Benchmarking LLaVA

We further explored how well VLMs perform engineering design tasks by performing the quantitative experiments with LLaVA 1.6 34B. It is an open-source VLM with both a chatbot and API interface, and for our experiments, we utilized the API interface with a “temperature” and “top-k” of 1.0. The results of the quantitative experiments using both GPT-4V and LLaVA 1.6 34B are shown in Table 20.

Across all 12 relevant quantitative experiments, GPT-4V outperforms LLaVA 1.6 34B. It is important to note that LLaVA 1.6 34B is only able to intake one input image per context, therefore the tasks that required two or more input images cannot be run with this model. We denote this in Table 20 where applicable, such as for the spatial reasoning experiments. One quantitative experiment that we ran with only GPT-4V was the design similarity triplet experiments. The reason being that LLaVA 1.6 34B often output nonsensical answers. For example when presented with three designs labeled A, B, and C, and asked which design is most similar to A, LLaVA 1.6 34B often answered “Design D,” or other irrelevant answers like “The image in the middle.” Therefore we could not use the triplet similarity results to measure self-consistency or transitive violations. We note, however, that GPT-4V did not have this problem.

Overall, this section demonstrates how the datasets and experiments published as part of this work can be used to benchmark future VLMs on engineering design tasks. Furthermore, these results reinforce the need for field-specific relevant datasets to evaluate the performance of machine-learning models. Indeed, our results strongly contrast with the performance reported by LLaVA 1.6 34B’s authors, showing that the model outperforms GPT-4V and Gemini Pro on several benchmarks (Liu et al. 2024).

Table 20. Total score for both GPT-4V and LLaVA 1.6 34B on the quantitative experiments in this work

	Max	GPT-4V	LLaVA 1.6 34B
Design description
With text description	30	30	26
No text description	30	16	14
No text descr., no N/A	30	21	13
Engg. drawing analysis	99	86	29
CAD generation (1st try)	54	28	3
Topology optimization	90	68	43
Design for manufacturing
Additive	90	− 22	N/A
Machining features	60	0	0
Crack inspections	345	204	172
Textbook questions	135	51	26^a
Spatial reasoning
Rotation	100	18	N/A
Packing	50	18	N/A
Total	1113	540	326

The maximum possible score is provided as the “Max.” column

^a15 of the 135 textbook questions involved two input images, which LLaVA 1.6 34B does not support

Discussion

In this paper, we aimed to assess the capabilities of VLMs in several engineering design tasks ranging from conceptual design to manufacturing, and develop a reusable benchmark to evaluate future VLMs. Below, we present a detailed discussion of our findings from each section.

Conceptual design

We examined design similarity analysis, sketch descriptions, and concept selection. We discovered that GPT-4V could evaluate design similarity with high self-consistency and minimal transitive violations. It was also consistent with human-generated idea maps in identifying unique sketches and groups of similar sketches. Additionally, it effectively matched design sketches to their descriptions when provided with the entire sketch (an average score of 10/10), including a handwritten description, but without the description, it often chose “None of the above,” and therefore performed worse (average score of 5.33/10). When “None of the above” was not an option, GPT-4V performed better (an average score of 7/10). This suggests a level of “caution,” so when GPT-4V has the chance to not be incorrect, it takes it. GPT-4V could generate useful and accurate text descriptions of designs even for sketches with very low drawing scores. Lastly, the model generated appropriate selection criteria but did not generate Pugh charts when only provided with design sketches. Overall, GPT-4V shows the great potential of VLMs for design sketch analysis and supporting the conceptual design stage beyond what has been identified in previous work (Siddharth et al. 2022; Stella et al. 2023).

System-level and detailed design

We investigated GPT-4V’s ability to use several Ashby diagrams to suggest appropriate materials, analyze engineering drawings, and generate CAD scripts. We found that GPT-4V could correctly respond where to look for materials in Ashby diagrams, but made errors when asked to be specific. The model faced difficulty in understanding the nuances of a block-with-blind-hole engineering drawing, but it was able to extract most dimensions and assign them appropriate labels. In terms of CAD generation ability, GPT-4V was successful in generating correct CAD scripts on the first attempt only in one out of nine experiments, and our iterations to fix the scripts did not improve the results. LLaVA 1.6 34B scored lower on the engineering drawing analysis and was unable to make a correct CAD script.

We investigated GPT-4V’s ability to understand and analyze structures resulting from topology optimization (TO). GPT-4V showed understanding of high-level TO principles like volume fraction, and proposed realistic locations for where boundary conditions may be. However, when provided with an image, it struggled to estimate the volume fraction of a design or identify the presence of floating material (Table 8). When allowed to use a code interpreter, however, GPT-4V was able to estimate volume fraction much more effectively. This suggests that for certain tasks, specifically those using spatial reasoning, engineers may benefit by integrating VLMs with other tools or plug-ins.

Manufacturing and inspection

In the manufacturing stage, we tested GPT-4V’s understanding of design for manufacturing (DfM) for subtractive and additive manufacturing operations. GPT-4V was, as we interpret it, cautious, and suggested that none of the additive manufacturing parts were printable, even when the parts were well within the provided guidelines. On the feature identification task for subtractive manufacturing, GPT-4V was able to identify at least one feature 12 out of 20 times, but never all of them. The provided explanations were most of the time inconsistent and confused different technical terms. Furthermore, we assessed GPT-4V’s ability to inspect images to find and identify defects. For the evaluated cases, GPT-4V tended to overly predict the presence of defects and was inconsistent in identifying the type of defect. Even more than in the detailed design stage, design for manufacturing and inspection is all about precision, and GPT-4V overall fails to deliver reliable and consistent performance on the evaluated task.

Engineering textbook problems

We evaluated GPT-4V’s ability on textbook problems in engineering. Overall, GPT-4V achieves rather low scores with a 36% accuracy. It performed the best for textbook problems asking for explanations (free text answers) but struggled for numerical questions, both in the reasoning and in the numerical value extraction from the provided images.

Spatial reasoning

Lastly, we evaluated GPT-4V’s spatial reasoning abilities through specific tests typically used to evaluate humans. Overall, GPT-4V achieves rather low scores with 36%, and 18% accuracy over the packing test, and the MechE rotation test respectively. GPT-4V’s scores are indistinguishable from random answer picking and the provided explanations did not match the visual representations. Given these low scores, the spatial reasoning tests could become a competitive benchmark to evaluate future multimodal LLMs. However, the current protocol for these tests, following the testing protocol for humans, requires models to accept multiple images, a capability most models don’t currently have.

Techniques for enhancing VLM performance

There are a number of techniques that can be applied during prompt construction or model inference, including in-context learning (ICL), retrieval-augmented generation (RAG), and the integration of other computational tools. For example, in-context learning is a prompting technique where the model is given a few input–output demonstrations of a task within the prompt itself, enabling it to generalize to new examples without additional fine-tuning Brown et al. (2020). This approach is especially useful in data-scarce scenarios, as it requires only a handful of examples to adapt the model to a new downstream task Luo et al. (2024); Dong et al. (2024). In the engineering design domain, researchers have explored the efficacy of ICL to better enable VLMs to match expert design evaluations Edwards et al. (2025). Their results indicate that the incorporation of ICL improves the model’s ability to match an expert, even outperforming a trained novice. However, the type of context matters: textual and image-based prompts can yield different results.

Another approach to improve LLM or VLM performance on domain-specific tasks is to incorporate retrieval-augmented generation (RAG) into the inference pipeline Lewis et al. (2020). RAG enables models to access information that they may have had limited or no exposure to during pretraining. RAG works by chunking documents into small sections, which are then embedded. During inference, a user’s query is also embedded, and cosine similarity is computed between the query and each document chunk to identify the N-most relevant chunks. These are appended to the user’s query and fed into the model as input for response generation.

RAG has demonstrated potential to improve model performance on domain-specific questions. For example, Xiong et al. (2024) show that coupling GPT-3.5 and Mixtral with a RAG framework improves their performance by up to 18% on medical question-answering. Alawwad et al. (2024) combine supervised fine-tuning and RAG to enhance LLM performance on textbook questions. Applying a RAG system to the questions in this study would likely improve model responses. However, prior work highlights limitations in RAG’s practical utility. For instance, Doris et al. (2025) used an off-the-shelf RAG system to provide VLMs with domain-specific technical content but found that it often failed to retrieve the most relevant sections. Even when provided with curated, relevant context–essentially “idealized” RAG–LLaVA-1.5 still struggled to answer questions due to inherent VLM limitations, such as inaccurate image analysis and limited engineering knowledge. Moreover, while some questions in our study lend themselves well to RAG (e.g., textbook problems), others (e.g., engineering drawing analysis) lack obvious reference documents. Future work might explore more targeted RAG systems–such as Siriwardhana et al. (2023), which are designed for domain-specific question answering–to improve performance on these tasks.

While this paper does not explore each prompting technique in depth, our goal was to use minimal, human-readable prompts to assess VLM performance, and to provide a benchmark that would enable future comparisons. Future work could iteratively implement these techniques to identify the most effective strategies. To support such efforts, we have released all original prompts and VLM outputs via the Harvard Dataverse Picard et al. (2024).

AI agents for engineering design

Current work on AI agents is still an emerging field. For example, one study mimics the traditional engineering design process by framing it as a multi-agent system Ocker et al. (2025). Another system uses multiple agents to support car design workflows, assigning different responsibilities to agents such as styling or CFD simulations Elrefaie et al. (2025). Yet another approach explores tool-augmented agents, enabling an LLM to use engineering tools via a Python API to solve problems such as CAD question-answering, sketch constraining, and parameterization Mallis et al. (2025).

While AI agents hold promise for assisting engineers, many commonly used tools are GUI-based and cannot be invoked via an API. GUI navigation, even for relatively simple tasks like performing a Google search, remains a significant challenge, and current agent frameworks are unlikely to scale to the complexity of professional engineering software. Unlike other UI navigation tasks, engineering workflows often require strong spatial reasoning to interpret graphs, 3D models, and other domain-specific visualizations. As a result, there remains a significant gap before AI agents can robustly support complex engineering design tasks.

Looking forward: vision-language models in engineering design

Future VLMs that are meant to assist with engineering tasks must understand dimensions, scales, and spatial reasoning. The visuals in engineering tasks often provide critical spatial information. However, current VLMs show weak performance when interpreting images for their precise spatial information. For instance, in Sect. 7, which explores spatial reasoning tasks, GPT-4V performed with about the accuracy of random guessing. Furthermore, in Sect. 4.2, GPT-4V was only successful in distinguishing that a design had a blind hole rather than a through hole from an engineering drawing 11% of the time. As shown in Sect. 4.3, both models in this study performed poorly when estimating the volume fraction of an image. Understanding relative dimensions, scales and orientations within a visual will greatly increase a VLM’s utility for engineering tasks and should be an area of future development.
To best serve engineering tasks, future models should be designed to process multiple images and text simultaneously. Many engineering tasks require viewing visuals in context with each other. This allows engineers to cross-reference charts, compare designs, etc. The most useful VLMs will also have this functionality so that they can more seamlessly integrate into an existing engineering workflow.
A useful model must understand and follow instructions. Especially if an engineer is using a VLM for the benefit of automation at large scales, this benefit is undermined if the VLM does not properly follow instructions. For example, if instructed to select from options A, B, and C, the model should not generate a fourth option D. This is a problem that arose in Sect. 3 in both matching design descriptions and answering the triplet similarity questions.
Engineers should be aware that current models may require users to iteratively adjust their prompts and images to obtain desired results. For certain tasks, this iteration and correction of the VLM may render its benefits moot. For other tasks, iterative feedback could be useful for exploring different solution strategies, or for ensuring that the user has mastery of a subject, and can therefore correct a VLM. The need for iteration is demonstrated in Makatura et al. (2024), which shows that LLMs have great potential to generate valid CAD scripts, however these results require users to iteratively guide the model to a desired output.
If a VLM is unsure of an answer, it has two primary options for its response: indicate that it is unsure and perhaps request more information, or make up a response that sounds correct. For engineers, we propose that a more useful VLM will indicate that it is unsure and ask for additional information. This occurred in Sect. 3 when trying to make a Pugh chart for concept selection. In this case the model requests more information about each design. In one trial, GPT-4V stated “Since I cannot physically interact with the concepts to evaluate them against the criteria you’ve provided, such as “Ease of handling” or “Durability,” I am unable to make informed decisions about the scores. These criteria often require subjective assessments or empirical testing that cannot be performed without direct interaction with the designs. However, I can guide you on how you might consider each criterion for scoring:...” This indicates that practitioners should be prepared to engage in an iterative information exchange with AI tools.

While this work investigates GPT-4V in-depth, we proposed a set of benchmark cases that could be used to assess other current and future models and that could inform about their performance for engineering design. In particular, we provide our datasets, prompts, and specific questions, where applicable. To demonstrate this work’s utility for benchmarking other VLMs, we perform comparative experiments with LLaVA 1.6 34B, and show the results in Table 20. Other open-source vision language models have been released recently, such as Fuyu-8B (Bavishi et al. 2023). Unfortunately, most of these models have not yet been fine-tuned and aligned to the level of GPT-4V, making them less practical out-of-the-box. As we have shown in Sect. 8, LLaVA 1.6 34B obtains slightly worse results across all of the quantitative experiments in this work. These open-source models, however, can be particularly useful as base models to develop custom fine-tuned models, for example, to specifically target the needs of engineers.

Limitations of this study

Evaluating the performance of GPT-4V faces the same challenges and concerns raised in previous studies of LLMs (Mao et al. 2023). Below, we highlight a few of the limitations.

Specificity of Engineering Problems: While we attempted to cover a wide range of engineering tasks, the study still focuses on a subset of engineering design problems. This could limit the applicability of the findings to other challenges encountered in the broad spectrum of the field.

Dependency on Prompt Engineering: The results might be highly sensitive to how prompts are engineered. Subtle variations in prompt structure or wording could lead to markedly different responses from the model, affecting the reliability of the evaluations. As detailed in the Discussion, there are numerous prompting techniques that can be explored to improve results, as well as experiments that vary the quantity and quality of input images. However, we note that our contribution is a benchmark, with released datasets of prompts, images, and outputs, designed to enable future research to compare against our work under consistent conditions.

Dataset Representativeness: The selected benchmark datasets, their quality, diversity, and representativeness can significantly impact the model’s performance and our results. While we created a large set of evaluation problems for VLMs, we recognize that these datasets might not fully capture the diversity and complexity of real-world engineering scenarios. This could affect the generalizability of the results to practical engineering applications.

Black-box and evolving models: Model changes, including data leaks, and the lack of control when using the chat interface, mean that we cannot fully define the experimental environment and some of the results may be different if reevaluated. However, for the assessments, we strived to create larger benchmarks within the limitation of the chat interface and repeated the experiments to obtain a better sample of the model’s performance. Further, since AI models are frequently updated, our results may not hold for long. Yet, while new VLMs will enable new capabilities, we believe that this study provides a lot of value by demonstrating the tasks that future models should be evaluated on as well as providing these tasks in our datasets. We release all our quantitative datasets to measure how much future models improve for different engineering problems.

Human-AI Interaction: An important part of the engineering design process involves how humans interact with design. In this study, we did not test the capabilities of how human designers may interact with VLMs and how this interaction influences the problem-solving process is crucial, as human biases, trust, and interpretations can affect the outcomes (Zhang et al. 2023).

In conclusion, while this study offers valuable insights into the capabilities of VLMs in addressing engineering design problems, it is essential to recognize these limitations as an integral part of our findings. They highlight the areas needing further exploration and remind us of the cautious approach required when generalizing AI capabilities to broader real-world applications. Our research is a step in an ongoing journey, contributing to the evolving dialogue on the role and effectiveness of AI in complex, multifaceted fields like engineering design.

Conclusion and future work

The first avenue for future research following this study involves expanding the scope and depth of the engineering problems evaluated. This can be achieved by incorporating a wider variety of engineering challenges, particularly from domains that were less represented in our initial study. We believe that industry can play a crucial role in this step, by providing representative problems for different types of engineering design tasks faced by them. Such diversity in problem selection will provide a more comprehensive understanding of VLM’s capabilities across the engineering design process. Alongside this, there is a need for enhanced dataset curation. Developing more robust datasets that closely mirror complex, real-world engineering scenarios can significantly improve the model’s evaluation. These datasets should capture the multifaceted and multidimensional nature of engineering tasks, allowing for a more nuanced assessment of multimodal VLMs’ applicability and effectiveness. An effort should also be made to avoid publicly available datasets, to limit challenges with evaluation data leaking into future model training.

Another critical area of future work lies in the realm of human-AI collaboration. It is imperative to study how engineers interact with VLMs in real-world design scenarios. Such studies can shed light on practical utility, user trust, and the integration of AI into engineering workflows. This includes understanding how engineers’ biases and decision-making processes interact with AI-generated solutions. Additionally, conducting longitudinal studies to monitor the impact of model evolution over time on its performance in engineering tasks will be highly beneficial. Given the rapid developments in AI, understanding how updates and changes affect its applicability and effectiveness is crucial. This will help in keeping the AI applications in engineering up-to-date and relevant, ensuring that they continue to meet the evolving demands of the field.

Author contributions

C.P. and K.M.E supervised the projects, assisted all co-authors in developing their sections, and wrote the introduction and the discussion. C.P. developed the spatial reasoning experiments and the evaluation of LLaVA 1.6. K.M.E developed the conceptual design experiments. A.C.D. developed the system-level and detail design, the design for additive manufacturing, and the post-manufacturing inspection sections. B.M. developed the textbook problems section. G.G. developed the topology optimization section. M.F.A. developed the subtractive manufacturing section. F.A. provided guidance across all sections and assisted in manuscript preparation and editing. All authors discussed the results and contributed to the final manuscript.

Funding

'Open Access funding provided by the MIT Libraries'. This study was supported by Swiss National Science Foundation (Postdoc.Mobility Fellowship P500PT_206937) and National Science Foundation Graduate Research Fellowship.

Data availability

Data needed for future evaluations, including the input images, input prompts, and answers for all quantitative experiments is available online: https://doi.org/10.7910/DVN/FLHZQE

Declarations

Competing interests

The authors declare no competing interests.

The majority of the research and experimental work for the initial version of this paper was carried out before November 2, 2023, leveraging GPT-4V released on September 25, 2023.

Trial 1 used the September 25th update of GPT-4V; Trials 2 and 3 used the November 6th update of GPT-4V.

The engineering drawing of the block with blind hole shown in Fig. 4.2.1 was created as an assignment for the “ENME272: Introduction to Computer Aided Design” course at the University of Maryland.

This experiment was conducted using the Nov. 6, 2023 version through the API interface.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Agarwal S, Wills J, Cayton L, Lanckriet G, Kriegman D, Belongie S (2007) Generalized non-metric multidimensional scaling. In: Proceedings of the eleventh international conference on artificial intelligence and statistics. Proceedings of machine learning research, vol 2, pp 11–18, San Juan, Puerto Rico, 21–24 March 2007. https://proceedings.mlr.press/v2/agarwal07a.html

Ahmed, F; Ramachandran, SK; Fuge, M; Hunter, S; Miller, S. Interpreting idea maps: pairwise comparisons reveal what makes ideas novel. J Mech Des; 2018; 141, 2 [DOI: https://dx.doi.org/10.1115/1.4041856] 021102, 12.

Alawwad HA, Alhothali A, Naseem U, Alkhathlan A, Jamal A (2024) Enhancing textbook question answering task with large language models and retrieval augmented generation. SSRN 4761601

Amabile, TM. Social psychology of creativity: a consensual assessment technique. J Pers Soc Psychol; 1982; 43, 5 997. [DOI: https://dx.doi.org/10.1037/0022-3514.43.5.997]

Arkoudas K (2023) GPT-4 can’t reason. https://arxiv.org/abs/2308.03762

Ashby, MF. Materials selection in mechanical design; 2016; Amsterdam, Elsevier:

Attaran, M. The rise of 3-D printing: the advantages of additive manufacturing over traditional manufacturing. Bus Horiz; 2017; 60, 5 pp. 677-688. [DOI: https://dx.doi.org/10.1016/j.bushor.2017.05.011]

Badini, S; Regondi, S; Frontoni, E; Pugliese, R. Assessing the capabilities of chatgpt to improve additive manufacturing troubleshooting. Adv Ind Eng Polym Res; 2023; 6, 3 pp. 278-287. [DOI: https://dx.doi.org/10.1016/j.aiepr.2023.03.003]

Baer J, Kaufman JC (2019) Assessing creativity with the consensual assessment technique. In: The Palgrave handbook of social creativity research. Springer, pp 27–37. https://doi.org/10.1007/978-3-319-95498-1_3

Bavishi R, Elsen E, Hawthorne C, Nye M, Odena A, Somani A, Taşırlar S (2023) Introducing our multimodal models. https://www.adept.ai/blog/fuyu-8b

Bendsøe, MP. Optimal shape design as a material distribution problem. Struct Optim; 1989; 1, 4 pp. 193-202. [DOI: https://dx.doi.org/10.1007/BF01650949]

Bendsøe, MP; Kikuchi, N. Generating optimal topologies in structural design using a homogenization method. Comput Methods Appl Mech Eng; 1988; 71, 2 pp. 197-224.969217 [DOI: https://dx.doi.org/10.1016/0045-7825(88)90086-2]

Berkowitz, M; Stern, E. Which cognitive abilities make the difference? Predicting academic achievements in advanced STEM studies. J Intell; 2018; 6, 4 48. [DOI: https://dx.doi.org/10.3390/jintelligence6040048]

Berkowitz, M; Gerber, A; Thurn, CM; Emo, B; Hoelscher, C; Stern, E. Spatial abilities for architecture: cross sectional and longitudinal assessment with novel and existing spatial ability tests. Front Psychol; 2021; 11, 4096. [DOI: https://dx.doi.org/10.3389/fpsyg.2020.609363]

Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss Ariel, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford Alec, Sutskever I, Amodei Dario (2020) Language models are few-shot learners. In: Proceedings of the 34th international conference on neural information processing systems, NIPS ’20. Curran Associates, Red Hook

Bryant C, Stone R, Mcadams D, Kurtoglu T, Campbell M (2005) Concept generation from the functional basis of design. In: Proceedings ICED 05, the 15th international conference on engineering design, 01

Bubeck S, Chandrasekaran V, Eldan R, Gehrke J, Horvitz E, Kamar E, Lee P, Lee YT, Li Y, Lundberg S, Nori H, Palangi H, Ribeiro MT, Zhang Y (2023) Sparks of artificial general intelligence: early experiments with GPT-4, April 2023. https://arxiv.org/abs/2303.12712

Budinoff HD (2019) Geometric manufacturability analysis for additive manufacturing. PhD thesis, University of California, Berkeley

Budinoff, HD; McMains, S. Will it print: a manufacturability toolbox for 3D printing. Int J Interact Des Manuf; 2021; 15, 4 pp. 613-630. [DOI: https://dx.doi.org/10.1007/s12008-021-00786-w]

Buehler, MJ. MeLM, a generative pretrained language modeling framework that solves forward and inverse mechanics problems. J Mech Phys Solids; 2023; 181, 4658514 [DOI: https://dx.doi.org/10.1016/j.jmps.2023.105454] 105454.

Cai, J; Yuan, Y; Sui, X; Lin, Y; Zhuang, K; Yun, X; Zhang, Q; Ukrainczyk, N; Xie, T. Chatting about chatgpt: how does chatgpt 4.0 perform on the understanding and design of cementitious composite?. Constr Build Mater; 2024; 425, [DOI: https://dx.doi.org/10.1016/j.conbuildmat.2024.135965] 135965.

Cao W, Robinson T, Hua Y, Boussuge F, Colligan AR, Pan W (2020) Graph representation of 3D CAD models for machining feature recognition with deep learning. In: vol 11A: 46th Design automation conference (DAC) of International design engineering technical conferences and computers and information in engineering conference, August 2020. https://doi.org/10.1115/DETC2020-22355

Chang Y, Wang X, Wang J, Wu Y, Zhu K, Chen H, Yang L, Yi X, Wang C, Wang Y, Ye W, Zhang Y, Chang Y, Yu PS, Yang Q, Xie X (2024) A survey on evaluation of large language models. ACM Trans Intell Syst Technol (TIST). Accepted

Chun J-H, Kim S-G (2004) 2.008 Design and manufacturing II, Spring 2004. Massachusetts Institute of Technology: MIT OpenCouseWare. https://ocw.mit.edu/courses/2-008-design-and-manufacturing-ii-spring-2004. Accessed 20 Oct 2023

Corbett, J; Crookall, JR. Design for economic manufacture. CIRP Ann; 1986; 35, 1 pp. 93-97. [DOI: https://dx.doi.org/10.1016/S0007-8506(07)61846-0]

Cseh, GM; Jeffries, KK. A scattered CAT: a critical evaluation of the consensual assessment technique for creativity research. Psychol Aesthet Creat Arts; 2019; 13, 2 159. [DOI: https://dx.doi.org/10.1037/aca0000220]

Das, M; Yang, MC. Assessing early stage design sketches and reflections on prototyping. J Mech Des; 2022; 144, 4 [DOI: https://dx.doi.org/10.1115/1.4053463] 041403, 02.

Dong Q, Li L, Dai D, Zheng C, Ma J, Li R, Xia H, Xu J, Wu Z, Chang B, Sun X, Li L, Sui Z (2024) A survey on in-context learning. In: Al-Onaizan Y, Bansal M, Chen Y-N (eds) Proceedings of the 2024 conference on empirical methods in natural language processing, Miami, Florida, USA, November. Association for Computational Linguistics, pp 1107–1128. https://doi.org/10.18653/v1/2024.emnlp-main.64. https://aclanthology.org/2024.emnlp-main.64

Doris AC, Grandi D, Tomich R, Alam MF, Ataei M, Cheong H, Ahmed F (2025) DesignQA: a multimodal benchmark for evaluating large language models’ understanding of engineering documentation. J Comput Inf Sci Eng 25(2):021009

Drela M, Hall S, Lagace PA, Lundqvist IK, Naeser G, Perry H, Radovitzky R, Waitz IA (2005a) Unified engineering I, II, III, & IV (supplementary notes for lectures m17-m20). Massachusetts Institute of Technology: MIT OpenCouseWare. https://ocw.mit.edu/courses/16-01-unified-engineering-i-ii-iii-iv-fall-2005-spring-2006/resources/zm17_20/. Accessed 18 Nov 2023

Drela M, Hall S, Lagace PA, Lundqvist IK, Naeser G, Perry H, Radovitzky R, Waitz IA, Young P, Craig JL (2005b) Unified engineering I, II, III, & IV (lecture notes). Massachusetts Institute of Technology: MIT OpenCouseWare. https://ocw.mit.edu/courses/16-01-unified-engineering-i-ii-iii-iv-fall-2005-spring-2006/resources/zm21/. Accessed 18 Nov 2023

Edwards KM, Tehranchi F, Miller SR, Ahmed F (2025) AI judges in design: statistical perspectives on achieving human expert equivalence with vision-language models. https://arxiv.org/abs/2504.00938

Edwards, KM; Peng, A; Miller, SR; Ahmed, F. If a picture is worth 1000 words, is a word worth 1000 features for design metric estimation?. J Mech Des; 2021; 144, 4 [DOI: https://dx.doi.org/10.1115/1.4053130] 041402, 12.

Elrefaie M, Qian J, Wu R, Chen Q, Dai A, Ahmed F (2025) AI agents in engineering design: a multi-agent framework for aesthetic and aerodynamic car design. https://arxiv.org/abs/2503.23315

Feng TH, Denny P, Wuensche B, Luxton-Reilly A, Hooper S (2024) More than meets the ai: Evaluating the performance of gpt-4 on computer graphics assessment questions. In: Proceedings of the 26th Australasian computing education conference, ACE ’24, New York, NY, USA. Association for Computing Machinery, pp 182–191. https://doi.org/10.1145/3636243.3636263

Frey D, Gossard D (2009) 2.007 Design and manufacturing I, Spring 2009. Massachusetts Institute of Technology: MIT OpenCouseWare. https://ocw.mit.edu/courses/2-007-design-and-manufacturing-i-spring-2009. Accessed 20 Oct 2023

Gao, S; Shah, JJ. Automatic recognition of interacting machining features based on minimal condition subgraph. Comput Aided Des; 1998; 30, 9 pp. 727-739. [DOI: https://dx.doi.org/10.1016/S0010-4485(98)00033-5]

Gerhard, P; Wolfgang, B; Joerg, F; Karl-Heinrich, G. Engineering design: a systematic approach; 2007; London, Springer: [DOI: https://dx.doi.org/10.1007/978-1-84628-319-2]

Henderson K (1999) On line and on paper: visual representations, visual culture, and computer graphics in design engineering. Inside Technology. MIT Press, Cambridge

Hubs (2023) What are the key design elements for 3D printing? https://www.hubs.com/knowledge-base/key-design-considerations-3d-printing/. Accessed 18 Nov 2023

John B, Sharon SM (2009) Assessing creativity using the consensual assessment technique. In: Handbook of research on assessment technologies, methods, and applications in higher education. IGI Global, pp 65–77. https://doi.org/10.4018/978-1-60566-667-9.ch004

Katz DM, Bommarito MJ, Gao S, Arredondo P (2023) GPT-4 passes the bar exam. SSRN 4389233. https://doi.org/10.2139/ssrn.4389233

Lewis, P; Perez, E; Piktus, A; Petroni, F; Karpukhin, V; Goyal, N; Küttler, H; Lewis, M; Yih, W; Rocktäschel, T et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv Neural Inf Process Syst; 2020; 33, pp. 9459-9474.

Liu F, Guan T, Li Z, Chen L, Yacoob Y, Manocha D, Zhou T (2023a) HallusionBench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for GPT-4V(ision), LLaVA-1.5, and other multi-modality models. URL https://arxiv.org/abs/2310.14566

Liu H, Li C, Li Y, Lee YJ (2023b) Improved Baselines with Visual Instruction Tuning, 10. https://arxiv.org/abs/2310.03744

Liu Y, Duan H, Zhang Y, Li B, Zhnag S, Zhao W, Yuan Y, Wang J, He C, Liu Z, Chen K, Lin D (2023c) Mmbench: is your multi-modal model an all-around player? arXiv:2307.06281

Liu H, Li C, Li Y, Li B, Zhang Y, Shen S, Lee YJ (2024) Llava-next: improved reasoning, ocr, and world knowledge, January. https://llava-vl.github.io/blog/2024-01-30-llava-next/

Luo M, Xu X, Liu Y, Pasupat P, Kazemi M (2024) In-context learning with retrieved demonstrations for language models: a survey. https://arxiv.org/abs/2401.11624

Maeda Y, Yoon SY (2013) A meta-analysis on gender differences in mental rotation ability measured by the Purdue spatial visualization tests: visualization of rotations (PSVT:R). Educ Psychol Rev 25(1):69–94. https://doi.org/10.1007/s10648-012-9215-x

Makatura L, Foshey M, Wang B, Hähnlein F, Ma P, Deng B, Tjandrasuwita M, Spielberg A, Owens CE, Chen PY, Zhao A, Zhu A, Norton WJ, Gu E, Jacob J, Li Y, Schulz A, Matusik W (2024) Large language models for design and manufacturing. An MIT exploration of generative AI, March 27. https://mit-genai.pubpub.org/pub/nmypmnhs

Mallis, D, Karadeniz AS, Cavada S, Rukhovich D, Foteinopoulou N, Cherenkova K, Kacem A, Aouada D (2025) CAD-assistant: tool-augmented vllms as generic cad task solvers. https://arxiv.org/abs/2412.13810

Manyika J, Hsiao S (2023) An overview of Bard: an early experiment with generative AI. https://ai.google/static/documents/google-about-bard.pdf

Mao, R, Chen G, Zhang X, Guerin F, Cambria E (2023) GPTEval: a survey on assessments of ChatGPT and GPT-4. https://arxiv.org/abs/2308.12488

Miller, SR; Hunter, ST; Starkey, E; Ramachandran, S; Ahmed, F; Fuge, M. How should we measure creativity in engineering design? A comparison between social science and engineering approaches. J Mech Des; 2021; 143, 3 [DOI: https://dx.doi.org/10.1115/1.4049061] 031404, 01.

Mundt M, Majumder S, Murali S, Panetsos P, Ramesh V (2019) Meta-learning convolutional neural architectures for multi-target concrete defect classification with the concrete defect bridge image dataset. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11196–11205

Nandy A, Goucher-Lambert K (2022) Do human and computational evaluations of similarity align? An empirical study of product function. J Mech Des 144(4):041404, 03. https://doi.org/10.1115/1.4053858

Nelson, MD; Goenner, BL; Gale, BK. Utilizing ChatGPT to assist CAD design for microfluidic devices. Lab Chip; 2023; 23, 17 pp. 3778-3784. [DOI: https://dx.doi.org/10.1039/D3LC00518F]

Nori H, King N, McKinney SM, Carignan D, Horvitz E (2023) Capabilities of GPT-4 on medical challenge problems. https://arxiv.org/abs/2303.13375

Nyemba WR (2022) Computer aided design: engineering design and modeling using AutoCAD. CRC Press, Boca Raton. https://doi.org/10.1201/9781003288626

Ocker F, Menzel S, Sadik A, Rios T (2025) From idea to CAD: a language model-driven multi-agent system for collaborative design. https://arxiv.org/abs/2503.04417

Okudan, GE; Tauhid, S. Concept selection methods—a literature review from 1980 to 2008. Int J Des Eng; 2009; 2, 3 pp. 243-277. [DOI: https://dx.doi.org/10.1504/IJDE.2008.023764]

OpenAI (2023) GPT-4V(ision) system card. Technical report, OpenAI

Picard C (2023) MechE rotation test, 11. ETH, Zurich

Picard C, Edwards KM, Doris AC, Man B, Giannone G, Alam MF, Ahmed F (2024) Data for evaluating vision-language models for engineering design. https://doi.org/10.7910/DVN/FLHZQE

Pugh S (1991) Total design. Addison-Wesley, Boston

Pugh S (1995) Concept selection—a method that works. In: International conference of engineering design, pp 497–506

Quaiser-Pohl, C. The mental cutting test “Schnitte” and the picture rotation test-two new measures to assess spatial ability. Int J Test; 2003; 3, 3 pp. 219-231. [DOI: https://dx.doi.org/10.1207/S15327574IJT0303_2]

Rosoł, M; Gąsior, JS; Łaba, J; Korzeniewski, K; Młyńczak, M. Evaluation of the performance of GPT-3.5 and GPT-4 on the polish medical final examination. Sci Rep; 2023; 13, 1 20512. [DOI: https://dx.doi.org/10.1038/s41598-023-46995-z] ISSN 2045-2322

Saka A, Taiwo R, Saka N, Salami B, Ajayi S, Akande K, Kazemi H (2023) GPT models in construction industry: opportunities, limitations, and a use case validation. https://arxiv.org/abs/2305.18997

Saravi M, Newnes L, Mileham AR, Goh YM (2008) Estimating cost at the conceptual design stage to optimize design in terms of performance and cost. In: Collaborative product and service life cycle management for a sustainable world. Springer, London, pp 123–130. https://doi.org/10.1007/978-1-84800-972-1_11

Shah, JJ; Smith, SM; Vargas-Hernandez, N. Metrics for measuring ideation effectiveness. Des Stud; 2003; 24, 2 pp. 111-134. [DOI: https://dx.doi.org/10.1016/S0142-694X(02)00034-0]

Shea, DL; Lubinski, D; Benbow, CP. Importance of assessing spatial ability in intellectually talented young adolescents: a 20-year longitudinal study. J Educ Psychol; 2001; 93, 3 pp. 604-614. [DOI: https://dx.doi.org/10.1037/0022-0663.93.3.604]

Shi Y, Peng D, Liao W, Lin Z, Chen X, Liu C, Zhang Y, Jin L (2023) Exploring OCR capabilities of GPT-4V(ision): a quantitative and in-depth evaluation. https://arxiv.org/abs/2310.16809

Siddharth, L; Blessing, L; Luo, J. Natural language processing in-and-for design research. Des Sci; 2022; 8, [DOI: https://dx.doi.org/10.1017/dsj.2022.16] e21.

Sigmund, O; Maute, K. Topology optimization approaches. Struct Multidiscip Optim; 2013; 48, 6 pp. 1031-1055.3138124 [DOI: https://dx.doi.org/10.1007/s00158-013-0978-6]

Siriwardhana, S; Weerasekera, R; Wen, E; Kaluarachchi, T; Rana, R; Nanayakkara, S. Improving the domain adaptation of retrieval augmented generation (rag) models for open domain question answering. Trans Assoc Comput Ling; 2023; 11, pp. 1-17.

Song, B; Miller, S; Ahmed, F. Attention-enhanced multimodal learning for conceptual design evaluations. J Mech Des; 2023; 145, 4 [DOI: https://dx.doi.org/10.1115/1.4056669] 041410, 02.

Song B, Zhou R, Ahmed F (2023b) Multi-modal machine learning in engineering design: a review and future directions. https://arxiv.org/abs/2302.10909

Starkey, E; Toh, CA; Miller, SR. Abandoning creativity: the evolution of creative ideas in engineering design course projects. Des Stud; 2016; 47, pp. 47-72. [DOI: https://dx.doi.org/10.1016/j.destud.2016.08.003]

Stechly K, Marquez M, Kambhampati S (2023) GPT-4 doesn’t know it’s wrong: an analysis of iterative prompting for reasoning problems, 10. https://arxiv.org/abs/2310.12397

Stella, F; Santina, CD; Hughes, J. How can LLMs transform the robotic design process?. Nat Mach Intell; 2023; 5, 6 pp. 561-564. [DOI: https://dx.doi.org/10.1038/s42256-023-00669-7]

Su H, Song B, Ahmed F (2023) Multi-modal machine learning for vehicle rating predictions using image, text, and parametric data. In: Proceedings of the international design engineering technical conferences & computers and information in engineering conference, Boston, MA. ASME

Takagi S, Watari T, Erabi A, Sakaguchi K (2023) Performance of GPT-3.5 and GPT-4 on the Japanese medical licensing examination: comparison study. JMIR Med Educ 9:e48002. https://doi.org/10.2196/48002. URL https://mededu.jmir.org/2023/1/e48002

Taraban, R. Information fluency growth through engineering curricula: analysis of Students’ text-processing skills and beliefs. J Eng Educ; 2011; 100, 2 pp. 397-416. [DOI: https://dx.doi.org/10.1002/j.2168-9830.2011.tb00019.x]

Toh, CA; Miller, SR. Choosing creativity: the role of individual risk and ambiguity aversion on creative concept selection in engineering design. Res Eng Des; 2016; 27, pp. 195-219. [DOI: https://dx.doi.org/10.1007/s00163-015-0212-1]

Toh, C; Miller, SR. Does the preferences for creativity scale predict engineering students’ ability to generate and select creative design alternatives?. J Mech Des; 2019; 141, 6 [DOI: https://dx.doi.org/10.1115/1.4042154] 062001, 04.

Ulrich KT, Eppinger SD, Yang MC (2020) Product design and development, 7th edn. McGraw-Hill, New York

Verhaegen P-A, Vandevenne D, Duflou JR (2012) Originality and novelty: a different universe. In: International design conference—DESIGN 2012, Dubrovnik, Croatia, 21–24 May 2012, pp 1961–1966

Wang X, Hu Z, Lu P, Zhu Y, Zhang J, Subramaniam S, Loomba AR, Zhang S, Sun Y, Wang W (2023) SciBench: evaluating college-level scientific problem-solving abilities of large language models. https://arxiv.org/abs/2307.10635

Webb C (2008) 45nm design for manufacturing. Intel Technol J 12(2):121–130

Wen L, Yang X, Fu D, Wang X, Cai P, Li X, Ma T, Li Y, Xu L, Shang D, Zhu Z, Sun S, Bai Y, Cai X, Dou M, Hu S, Shi B (2023) On the road with GPT-4V(ision): early explorations of visual-language model on autonomous driving. https://arxiv.org/abs/2311.05332

Woldseth, RV; Niels Aage, J; Bærentzen, A; Sigmund, O. On the use of artificial neural networks in topology optimisation. Struct Multidiscip Optim; 2022; 65, 10 294. [DOI: https://dx.doi.org/10.1007/s00158-022-03347-1]

Xiong G, Jin Q, Zhiyong L, Zhang A (2024) Benchmarking retrieval-augmented generation for medicine. In: Findings of the Association for Computational Linguistics ACL 2024, pp 6233–6251

Yan, F; Wang, L; Li, L. Conceptual design scheme automatic generation and decision-making considering green demand. Procedia Manuf; 2020; 43, pp. 407-414. [DOI: https://dx.doi.org/10.1016/j.promfg.2020.02.194]

Yang S, Zhao YF (2015) Additive manufacturing-enabled design theory and methodology: a critical review. Int J Adv Manuf Technol 80:327–342. https://doi.org/10.1007/s00170-015-6994-5

Yang Z, Li L, Lin K, Wang J, Lin C-C, Liu Z, Wang L (2023b) The dawn of LLMs: Preliminary explorations with GPT-4V(ision). URL https://arxiv.org/abs/2309.17421

Yang J, Zhang H, Li F, Zou X, Li C, Gao J (2023a) Set-of-mark prompting unleashes extraordinary visual grounding in GPT-4V, 10. https://arxiv.org/abs/2310.11441

Yoon SY (2011) Psychometric properties of the revised purdue spatial visualization tests: visualization of rotations (the revised PSVT:R). PhD thesis, Purdue University

Yuan, C; Marion, T; Moghaddam, M. Leveraging end-user data for enhanced design concept evaluation: a multimodal deep regression model. J Mech Des; 2021; 144, 2 [DOI: https://dx.doi.org/10.1115/1.4052366] 021403, 09.

Zhang, Z; Jaiswal, P; Rai, R. FeatureNet: machining feature recognition based on 3D convolution neural network. Comput Aided Des; 2018; 101, pp. 12-22. [DOI: https://dx.doi.org/10.1016/j.cad.2018.03.006]

Zhang, G; Raina, A; Brownell, E; Cagan, J. Artificial intelligence impersonating a human: the impact of design facilitator identity on human designers. J Mech Des; 2023; 145, 5 [DOI: https://dx.doi.org/10.1115/1.4056499] 051404, 01.

Zheng L, Chiang W-L, Sheng Y, Zhuang S, Wu Z, Zhuang Y, Lin Z, Li Z, Li D, Xing E, Zhang H, Gonzalez JE, Stoica I (2023) Judging llm-as-a-judge with mt-bench and chatbot arena. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S (eds) Advances in neural information processing systems, vol 36. Curran Associates, pp 46595–46623. https://proceedings.neurips.cc/paper_files/paper/2023/file/91f18a1287b398d378ef22505bf41832-Paper-Datasets_and_Benchmarks.pdf

Zhu K, Zhao Q, Chen H, Wang J, Xie X (2024) Promptbench: a unified library for evaluation of large language models. J Mach Learn Res (JMLR) 25:1–22

Zong, M; Krishnamachari, B. Solving math word problems concerning systems of equations with GPT models. Mach Learn Appl; 2023; 14, [DOI: https://dx.doi.org/10.1016/j.mlwa.2023.100506] 100506.

Word count: 26489

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

From concept to manufacturing: evaluating vision-language models for engineering design

Content area

Abstract

Full text