Content area
The process of developing software is intricate and time-consuming. Resource estimation is one of the most important responsibilities in software development. Since it is currently the only acceptable metric, the functional size of the program is used to generate estimating models in a widely accepted manner. On the other hand, functional size measurement takes time. The use of artificial intelligence (AI) to automate certain software development jobs has gained popularity in recent years. Software functional sizing and estimation is one area where artificial intelligence may be used. In this study, we investigate how to apply the concepts and guidelines of the COSMIC method to measurements using ChatGPT 4o, a large language model (LLM). To determine whether ChatGPT can perform COSMIC measurements, we discovered that ChatGPT could not reliably produce accurate findings. The primary shortcomings found in ChatGPT include its incapacity to accurately extract data movements, data groups, and functional users from the text. Because of this, ChatGPT’s measurements fall short of two essential requirements for measurement: accuracy and reproducibility.
INTRODUCTION
One of the five standards of the functional size measurement method (FSMM) can be used to assess the functioning of software products; COSMIC ISO/IEC 19761 is the only second-generation FSMM [1]. The competitive software development industry can address the estimating project problem by assessing functional size using a standard; several estimation methods have been established, such as [2], [3], and software development productivity can be measured.
For more than 70 years of research, software estimation has been a focus for numerous researchers since its inception in the 1950’s [4]. During this period, studies have yielded some significant discoveries: Precise estimation is a crucial component of software development and a key factor in project failure. It is one of the most critical components of the process. It also has a significant effect on project planning and industrial budgets [5–7].
The idea of automating software development chores with artificial intelligence (AI) has gained traction in recent years. Software functional sizing and estimation is one field in which artificial intelligence tries to demonstrate practical and accurate use [6, 8, 9].
The objective is to shorten the time required for measuring using the standard Functional Size Measurement Methods (FSMM). This will enable businesses to quickly estimate functional sizes by having the ability to measure user requirements accurately and promptly, most often provided in text format.
ChatGPT is one of the most advanced models of AI technology, offering some amazing and useful solutions in many fields, such as marketing [10], book creation/edition [11, 12], graphic design [13], video creation/edition [14], music edition [15], and so forth. However, not every use has been effective; some attempts have led to pertinent failures or even instances of plagiarism [16].
In this article, we unbiasedly examine whether it is feasible to measure user requirements using ChatGPT 4o and determine whether it is not by providing a specific prompt that outlines the fundamentals of the COSMIC technique.
This paper’s outline is as follows. Background information on software estimates and measurement, large language models, functional size measurement, and measurement repeatability is given in Section 2. The experimental protocol and its implementation are explained in Section 3. The data acquired were covered in Section 4, and the conclusions are finally covered in Section 5. Appendix 1 has the functional size measurement of the user requirements using the COSMIC method, and appendix 2 has the output from chatGPT.
BACKGROUND
Software Measurement and Estimation
The literature on software estimation has a wide range of techniques developed over more than six decades [4]. This has resulted in several estimation methods [2, 3, 6, 7], numerous classifications of these methods [5–7], [17–19], and various estimation process topologies [20, 21]. Despite this extensive catalog of techniques, there is still no consensus on a single model that consistently produces accurate results for all industrial projects.
Even though regression-based estimating techniques based on reference databases predominate in the literature, it is not uncommon to find it difficult to reproduce research [9, 17, 22]. Several authors point out that measuring the size of the program is essential to the precision of approximations [23–26]. According to Fedotova et al. [4], the lack of a size variable may contribute to regression-based models' inability to perform well in estimation.
Neural networks (NNs) and other machine learning (ML) techniques have proven to be highly effective in producing accurate predictions, even in situations where noise has severely distorted the input data and the relationships between the inputs and outputs are complex [18]. Their capacity to learn and adapt through tweaking parameters and network design is partly responsible for this strength. But NNs have serious drawbacks as well, like not having an explanation facility and having trouble generalizing solutions as circumstances change.
The academic literature points to several difficulties in the subject, chief among them being the scarcity of real-world datasets (such as those from NASA, ISBSG, Desharnais, and COCOMO) [6]. The use of AI techniques is significantly hampered by this lack.
In the reviewed literature, only two approaches using AI to measure functional size were found, both employing the COSMIC standard [8, 27]. Ungan [8] has presented a technology that measures user requirements based on free-form text. To attain a “precise” measure, it necessitates clear and high-quality specifications, which makes it a closed source. Free-form requirements are by definition prone to being ambiguous, long, and incomplete, especially in the early stages of a project [28].
The other method involves measuring a reference case study using ChatGPT for the first time. This approach yields less than ideal results when applying COSMIC principles and rules; even in cases where the sizes are comparable, there are numerous errors in identifying data groups, data movements, and functional users based on the prompt requirements [27].
The LLM Model: ChatGPT
A Large Language Model (LLM) is an artificial intelligence (AI) model designed to understand and generate human-like texts. LLMs are typically based on deep learning architectures, such as Transformers, and are trained on large amounts of text data to learn the patterns and structures of natural language. These models can perform a variety of language related tasks, including text generation, language translation, question answering, summarization, and more [29].
LLMs have demonstrated remarkable capabilities in “understanding” and generating text across different languages and domains. They are widely used in various applications, such as virtual assistants, chatbots, content generation, language translation services, and natural language processing tasks Examples of popular LLMs include OpenAI’s GPT series (such as GPT-3, GPT3.5, and GPT-4) and Google’s BERT.
ChatGPT is an LLM developed by OpenAI, specifically based on the GPT (Generative Pre-trained Transformer) architecture. It uses artificial intelligence to generate responses in text conversations. ChatGPT is designed to “understand” and generate coherent text across many topics and contexts, using machine learning to improve responsiveness and understanding [16].
The functioning of LLMs is based on two main phases: training and fine-tuning.
Training: In this phase, the model is trained on large volumes of text, learning to predict the next word in each sequence. This process is conducted in an unsupervised manner, meaning the model does not need specific labels to learn. Through millions of iterations, the model adjusts its internal weights to minimize the difference between its predictions and the actual words in the training data.
Fine-tuning: Once the model is trained, it is refined with more specific and labeled data for particular tasks, such as answering questions, translating text, or summarizing content. This phase allows the model to tailor its capabilities to more concrete applications and improve its performance on specific tasks.
Text generation in LLMs, like ChatGPT, is based on the model’s ability to predict the next word in a text sequence. When given an input, the model evaluates the previous words and generates a list of possible next words and their associated probabilities. The word with the highest probability is selected, and the process repeats until the response is complete.
The model uses attention layers to determine which parts of the previous text are most relevant for predicting the next word. These attention layers enable the model to capture long-range dependencies and complex relationships between words, thus enhancing the coherence and relevance of the generated text [16].
Considering the above, LLMs can perform a form of reasoning based on statistical and contextual patterns learned during training. The models do not have understanding or awareness but operate based on correlations and patterns in the training data. For example, they can answer questions, solve simple math problems, and follow complex instructions by recognizing patterns in the text. However, their reasoning is limited to the information available in their training data and their ability to manipulate this data within their statistical model.
Measurement of the Functional Size of Software using Cosmic
Functional size measurement methods (FSMM) are currently divided into two generations [1], with COSMIC ISO/IEC 19761 [28] being the only second-generation FSMM in use. The ISO/IEC 14143 standard and the lessons learned from first-generation methodologies were the foundation for developing this standard [30]. As a result, it tackles numerous problems raised by these previous approaches, including handling out-of-date ideas that were pertinent when the first-generation methods were developed but are now irrelevant, the scope of application, unworkable measurement scales, and a narrow field of application. The COSMIC Measurement Manual [28] presents all the guidelines, precepts, and examples required to carry out functional size measurements with the COSMIC approach.
In real-world projects, approximating a functional size can be necessary in several situations [31]. These include: (1) when a size is required but not enough time or resources are available to measure using the standard method; (2) early in the project’s life cycle, before the functional user requirements (FUR) have been detailed to the point where an accurate size measurement is possible; and (3) when the documentation quality of the actual requirements is inadequate for an accurate measurement.
The COSMIC method has been used to build several approximation techniques. These have all been formal research projects that have produced papers and comprehensive data on the design and functionality of the approach as well as quality evaluation criteria derived from experiments for analysis and comparison. Since these techniques are general, they need to be examined or adjusted using particular data. The Early Software Sizing with COSMIC: Experts Guide [32] contains the only methods that have been thoroughly reviewed.
Measurers should constantly attempt to obtain as much information and specifics on describing the actual requirements, whether measuring exactly or roughly. The functional size assessment can then be as accurate as feasible by using assumptions [32].
Reproducibility Importance in Metrology
In any scientific discipline, the validation of results is indispensable. Reproducibility allows other researchers to verify the findings of a study by replicating the same experiment or measurement under the same conditions. If the results can be reproduced, it reinforces the credibility and validity of the original work. This is particularly important in metrology, where the precision and accuracy of measurements directly impact the quality of technological products and services.
Reproducibility also promotes transparency in research if the proposed or used metrics have reproducibility other scientists can follow the same steps and arrive at similar conclusions, fostering trust in the obtained results. In the context of software engineering, this could involve data used in the experiments. The transparency thus achieved is crucial for the advancement of the discipline, actually software engineering is considered immature [1], [33].
EXPERIMENTAL PROCEDURE
To conduct this experiment, we utilized version 4o of ChatGPT (dated 08-08-2024) and the user requirements information from the C-Reg case study [34]. This case study provides the functional size measurement of the requirements for an actual system using the COSMIC method. For reference, the functional size of the C-Reg system is detailed in Appendix 1.
Figure 1 illustrates the flow of the experimental procedure to obtain the results presented in Table 1. The first step involved creating a fine-tuned prompt based on the recommendations proposed by [16], which included incorporating some knowledge about COSMIC to be considered by ChatGPT.
Fig. 1. [Images not available. See PDF.]
Flow of experimental procedure diagram.
Table 1. . Comparison of the results of applying the COSMIC method to the C-Reg [34] requirements against the results obtained by chatGPT in two different executions
ID | Functional process | Measured CFP | Measured CFP (ChatGPT) | Diff | MRE |
|---|---|---|---|---|---|
1 | Add teacher details (Prompt1) | 4 | 7 | 3 | 75.0% |
1 | Add teacher details (Promp2) | 4 | 10 | 6 | 150% |
5 | Consult the Course Offerings (Teacher ) (Prompt1) | 7 | 6 | 1 | 14.2% |
5 | Consult the Course Offerings (Teacher) (Prompt2) | 7 | 10 | 3 | 42.8% |
15 | Modify student schedule (Prompt1) | 8 | 10 | 2 | 25.0% |
15 | Modify student schedule (Promp2) | 8 | 6 | 2 | 25.0% |
The second step is to include the FUR for three functional processes described in the C-Reg system into the prompt and executed it using ChatGPT two times each.
Finally we compare the measured size using chatGPT against the real measurements obtained in [34].
Fine-Tuned Prompt Creation
To generate the prompt, it was necessary to describe some aspects of the COSMIC measurement method [28]. Firstly, we defined data groups, data movements, and the definition of functional users.
Creating and fine-tuning the ChatGPT prompt to create an effective functional size measurer involves carefully designing the model to categorize objects of interest, functional users, and the types of data movements [16].
Firstly, we ask to describe the given use case and to include the data groups and their movements according to the COSMIC measurement method [28].
Next, we describe and give examples of what a data movement is, what a functional user is, and what an object of interest is. The examples of these concepts improved how chatGPT classified the content of the use case.
Next, we gave hints on what systems mentioned in the requirements are beyond the measurement scope and must be considered as functional users (Course catalog system and Billing system).
Then there is a space where it is needed to insert any use case.
Finally, we ask that the data groups used, and their movements be explicitly included according to the COSMIC standard, count the times the data groups are moved in the functional process described above, and place them in a table.
Below is the prompt that was used to conduct the tests:
Describe the given case use and explicitly include the data groups used and their movements according to the COSMIC standard, the data movements move a group of data and can be read from the database, writing to the database, input from a functional user and output to a functional user, functional users are everything with which the system interacts (e.g. people who use the system, other systems with which it communicates, different systems from which it receives data), data groups describe an object of interest that can be a real world object or a conceptual object (e.g. user, payroll, catalogs, teachers, students, courses, workers).
The Course Catalog system and the Billing System are functional users, so to interact with them, there must be exit and entry movements with these.
Given the following use case:
[Insert a Functional Process from C-reg (Appendix 1)]
Explicitly include the data groups used and their movements according to the COSMIC standard, the movements move a group of data and can be read from the database, written to the database, entry from a functional user, and output to a functional user (e.g. people who use the system, other systems with which it communicates), data groups describe an object of interest that can be a real-world object or a conceptual object (e.g. user, payroll).
Additionally, count the times the data groups are moved in the functional process described above and place them in a table with the form: data group, movement, value.
Prompts Execution
Once the execution of distinct prompts was developed, the results were collected and shown in Table 1. The functional process ID is shown in the first column, and the name of the selected functional process is shown in the second. The functional size derived from the COSMIC technique, as per Appendix 1, is shown in column three. The functional size obtained from the first prompt execution utilizing the developed prompt and the Magnitude of Relative Error (MRE) for that first prompt execution is displayed in column six.
Using the information in Table 1, the quality criteria for estimation were evaluated to analyze the robustness of the model, which are mean magnitude of relative error (MMRE), MRE standard deviation (SDRMS), and the Prediction level at 10% (Pred 10%)
MMRE: | 0.553 |
RMSE: | 3.240 |
SDRMS: | 0.510 |
Pred(10%): | 0 |
Based on the quality criteria, it can be mentioned that there is an average relative error of 5.53%, with a standard deviation of 0.510, and all the measures by ChatGPT are not within the 10% prediction level.
The results show that the size measured using ChatGPT 4o has a difference greater than 10% from the real size measured with the COSMIC method in every FP.
Since the COSMIC method is a standard, the goal of a functional size measure is to be reproducible and audited, any difference could put at risk the project success since there will be a difference in estimating the necessary resources.
DISCUSSION
From Table 1, it was observed that like the findings in the article by Hartenstein et al. [27], ChatGPT exhibits some consistency in the total measurement value, in this experiment (only three functional processes) with variation. However, at the individual level, the functional processes yield different results, while functional process 1 shows a significant percentage variation. This observation suggests that there is no reproducibility based solely on the measurement value.
But after carefully examining ChatGPT’s responses (Annex 2), we can see that – even in cases where the definition of the text has been given – it is inconsistent to identify data groups, data movements, and functional users straight from the text – elements that software measurers are familiar with. Refer to Tables 2 and 3.
Table 2. . ChatGPT functional size response 1 for the Add teacher details functional process. Functional size = 7 CFP
Data Group | Movement | Value |
|---|---|---|
Command | Entry | 1 |
Form Template | Exit | 1 |
Teacher Data | Entry | 1 |
Teacher Data | Read | 1 |
Error Messages | Exit | 2 |
Teacher Data | Write | 1 |
Table 3. . ChatGPT functional size response 2 for the Add teacher details functional process. Functional size = 10 CFP
Data Group | Movement | Value |
|---|---|---|
Command | Entry | 1 |
Form Template | Exit | 1 |
Teacher Data | Entry | 1 |
Teacher Data | Read | 1 |
Erro rMessages | Exit | 2 |
Teacher Data | Write | 1 |
Work Area Data | Write | 1 |
Course Catalog Data | Exit | 1 |
Billing Data | Exit | 1 |
We did not measure all the functional unit requirements (FURs) contained in the C-Reg system documentation because it was not feasible to obtain replicable results when applying the COSMIC principles and rules using ChatGPT. The examples of the functional processes analyzed are provided in Appendix 2, each with two responses provided by the LLM, which at the detailed level exhibit significant inconsistencies.
One of the primary reasons for using a standard metric is to enable auditability of results. However, in this case, even though the results could be analyzed for accuracy (quality criteria like MMRE, STDEV, etc.), they cannot be audited due to the varying elements used to derive the size. Therefore, it is not possible to consider these results as measurements. At best, they could be considered an approximation approach with some considerations.
As an approximation approach, it has not been studied as extensively as other methods. The results do not provide a clear route for gathering or improvement; they seem more like guesses or luck. This highlights the need for further research and development to address the inherent challenges of using natural language processing models for software measurement and estimation.
From the observations made in this experiment, it becomes apparent that any approximation approach based on text may encounter similar challenges. It is easy to understand this because LLM models like ChatGPT can perform a form of reasoning based on statistical and contextual patterns learned during training. This implies that there should be identifiable patterns or repeated elements in a text that the model can recognize and utilize. However, this is an open question in the software requirement research field, predating the existence of LLM or natural language processing (NLP) technology.
Additionally, numerous subjective elements such as different local expressions, language variations, communication styles, abstractions, etc., could make this a more difficult task. These factors contribute to the complexity of accurately interpreting and analyzing text-based data, making it challenging to develop robust approximation approaches in software requirements. Therefore, addressing these challenges will require interdisciplinary efforts and innovative solutions to advance the state of the art in software measurement and estimation.
CONCLUSIONS
In this study, we proposed an experiment to determine whether ChatGPT could perform COSMIC measurements. However, we discovered that ChatGPT could not reliably produce accurate findings at the detailed level. The primary shortcomings found in ChatGPT include its incapacity to accurately extract data movements, data groups, and functional users from the text.
As we can see, the ChatGPT system failed to apply the COSMIC methodology to a functional requirement, so the results are not necessarily correct. In more complex cases, the results may significantly differ from the correct functional size.
From this experiment, we observe that the results produced by the ChatGPT model are not consistent (reproducibly), leading to different results in different executions. This inconsistency generates erroneous measurements and incorrect information, a principal element for metrology, which could put at risk the success of the project.
We can observe that ChatGPT does not adhere to the COSMIC methodology, so the resulting measurements, although correct in some cases, are merely coincidental. If the measurement were audited, it would likely not pass the COSMIC method application, which is a significant issue in software contracting.
The inconsistencies could be due to the function of an LLM, such as ChatGPT, which simulates human language and is based on the probabilities of the next word in a text sequence. Therefore, it cannot understand the steps to perform a measurement using a standardized method like COSMIC.
Based on the experiment, it is challenging to conceive that LLM models could accurately measure software using a Functional Size Measurement Method (FSMM) because they operate on patterns and structured data. In contrast, FURs are often described in free text and depend on the individual writer, leading to variations in language, communication styles, and abstractions.
One of the challenges with LLMs is their knowledge base, which may become outdated over time. While retrieval-augmented generation (RAG) combines LLMs with information retrieval systems to enhance text accuracy and relevance, it could be helpful to improve the results.
Indeed, our challenge extends beyond the capabilities of current technology like ChatGPT, it includes the inherent variability and subjectivity of human language. Despite the significant advancements in natural language processing, there is no replicable or consistent way to interpret functional unit requirements (FURs) due to the lack of standardized descriptions.
The proposal by Gérançon et al. [35] offers a potential approach to address this issue. However, current tools, even the most advanced ones like ChatGPT, fall short in meeting the essential requirements for measurement using the COSMIC method directly from the FURs text descriptions: accuracy and reproducibility.
This underscores the need for continued research and development efforts to tackle the fundamental challenges posed by human language interpretation in software measurement and estimation until we can first establish standardized and consistent ways of describing requirements, achieving accurate and reproducible measurements instead of guessing a functional size.
Therefore, it’s crucial to recognize that the limitations and issues faced in using AI for estimation tasks extend to approximating functional size from text. Addressing these challenges requires advancements in AI technology, as well as a deeper understanding of the complexities of human language and the specific requirements of the software engineering domain.
LIMITATIONS
Since OpenAI’s ChatGPT model, is not open source, and they can change the way the model responds to mitigate risky results according to their policy [29], the quality of the results in this proposal has been changing from the beginning of this article until the time of its publication (a reproducibility problem).
The steps to perform a functional size measurement using the COSMIC method require knowing the context of the system to be measured, such as the attributes of an object of interest. Thus, the way an LLM will group the attributes of an object of interest could be different from those identified in the measurement process.
How user requirements are obtained can vary, so it is a challenge to create a prompt that could cover everything.
FUTURE WORK
Replicating the experiment with all the functional processes from the case study [34], along with additional real software applications, would provide valuable insights into the performance and limitations of ChatGPT in measuring functional size from text. This expanded experimentation would allow a more comprehensive assessment of the model’s capabilities across different scenarios and contexts.
Using different or improved versions of LLM, such as ChatGPT 4, 4+, 4o, LLAMA 3.1, Gemini, and Bart, could offer further insights into the model’s consistency and performance. To get consistent results under the same prompt in any model would be crucial in determining its reliability.
Finally, exploring the development of a system that leverages ChatGPT for specific parts of the measurement methodology, such as transforming user stories into functional user actions or wrapping attributes in objects of interest, could lead to creating a more accurate measurement model. A mixed approach, similar to the retrieval-augmented generation (RAG) approach, could combine the strengths of ChatGPT’s language processing capabilities with other techniques or models to enhance accuracy and reliability in functional size measurement.
FUNDING
This work was supported by ongoing institutional funding. No additional grants to carry out or direct this particular research were obtained.
CONFLICT OF INTEREST
The authors of this work declare that they have no conflicts of interest.
Publisher’s Note.
Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
AI tools may have been used in the translation or editing of this article.
REFERENCES
1 Abran, A. Software Metrics and Software Metrology; 2010; [DOI: https://dx.doi.org/10.1002/9780470606834.ch2]
2 Silhavy, R.; Prokopova, Z.; Silhavy, P. Algorithmic optimization method for effort estimation. Program. Comput. Software; 2016; 42, pp. 161-166.
3 Durán, M.; Juárez-Ramírez, R.; Jiménez, S. User story estimation based on the complexity decomposition using Bayesian networks. Program. Comput. Software; 2020; 46, pp. 569-583. [DOI: https://dx.doi.org/10.1134/S0361768820080095]
4 Fedotova, O.; Teixeira, L.; Alvelos, A.H. Software effort estimation with multiple linear regression: Review and practical application. J. Inf. Sci. Eng.; 2013; 29, pp. 925-945.
5 Lee, T.K., Wei, K.T., and Ghani, A.A.A., Systematic literature review on effort estimation for Open Sources (OSS) web application development, Proc. IEEE Future Technolies Conf., FTC 2016, San Francisco, CA, 2016: pp. 1158–1167. https://doi.org/10.1109/FTC.2016.7821748.
6 Sharma, P. and Singh, J., Systematic literature review on software effort estimation using machine learning approaches, Proc. IEEE Int. Conf. on Next Generation Computing and Information Systems (ICNGCIS 2017), Jammu, 2017, pp. 54–57. https://doi.org/10.1109/ICNGCIS.2017.33.
7 Carbonera, C.E.; Farias, K.; Bischoff, V. Software development effort estimation: A systematic mapping study. IET Res. J.; 2020; 14, pp. 1-14. [DOI: https://dx.doi.org/10.1049/iet-sen.2018.5334]
8 Ungan, E., Hammond, C., and Abran, A., Automated COSMIC measurement and requirement quality improvement through ScopeMaster® tool, in Proc. Acad. Pap. IWSM Mensura 2018 “COSMIC Functional Points – Fundamental Software Effort Estimation Held Conjunction with China Software Cost Measurement (CSCM 2018), CEUR Workshop Proc. (CEURWS.org), Murat Salmanoglu, A.C., Ed., Beijing, 2018, pp. 1–13.
9 Braga, P.L., Oliveira, A.L.I., and Meira, S.R.L., Software effort estimation using machine learning techniques with robust confidence intervals, Proc. 7th Int. Conf. on Hybrid Intelligent Systems, Kaiserslautern, 2007.
10 Zhang, Y. and Prebensen, N.K., Co-creating with ChatGPT for tourism marketing materials, Ann. Tourism Res. Empirical Insights, 2024, vol. 5, no. 1, p. 100124. https://doi.org/10.1016/j.annale.2024.100124
11 Altmäe, S.S.-L., and Alberto Salumets, A., Artificial intelligence in scientific writing: a friend or a foe?, Reprod. Biomed. Online, 2023, vol. 47, no. 1. https://doi.org/10.1016/j.rbmo.2023.04.009
12 Zuckerman, M.; Flood, R.; Tan, R.J.B.; Kelp, N.; Ecker, D.J.; Menke, J.; Lockspeiser, T. ChatGPT for assessment writing. Med. Teach.; 2023; 45, pp. 1224-1227. [DOI: https://dx.doi.org/10.1080/0142159X.2023.2249239]
13 Putjorn, T. and Putjorn, P., Augmented imagination: Exploring generative AI from the perspectives of young learners, Proc. 15th Int. Conf. on Information Technology and Electrical Engineering (ICITEE), Chiang Mai, 2023, pp. 353–358. https://doi.org/10.1109/ICITEE59582.2023.10317680
14 Bengesi, S., et al., Advancements in generative AI: A comprehensive review of GANs, GPT, autoencoders, diffusion model, and transformers, 2023. arXiv:2311.10242
15 What Is ChatGPT, DALL-E, and Generative AI?, McKinsey & Co., 2023.
16 OpenAI, Achiam, J., Adler, S., and Agarwal, S., GPT-4 technical report, 2024. arXiv:2303.08774
17 Jørgensen, M.; Shepperd, M. A systematic review of software development cost estimation studies. IEEE Trans. Software Eng.; 2007; 33, pp. 33-53. [DOI: https://dx.doi.org/10.1109/TSE.2007.256943]
18 Bilgaiyan, S.; Sagnika, S.; Mishra, S.; Das, M. A systematic review on software cost estimation in agile software development. J. Eng. Sci. Technol. Rev.; 2017; 10, pp. 51-64. [DOI: https://dx.doi.org/10.25103/jestr.104.08]
19 Kinoshita, N., Monden, A., Tshunoda, M., and Yucel, Z., Predictability classification for software effort estimation, Proc. 3rd IEEE/ACIS Int. Conf. on Big Data, Cloud Computing, Data Science and Engineering, BCD 2018, Yonago, 2018.
20 Britto, R., Freitas, V., Mendes, E., and Usman, M., Effort estimation in global software development: A systematic literature review, Proc. 9th IEEE Int. Conf. on Global Software Engineering, ICGSE 2014, Shanghai, 2014, pp. 135–144. https://doi.org/10.1109/ICGSE.2014.11.
21 Valdés-Souto, F., Validation of supplier estimates using cosmic method, Proc. CEUR Int. Workshop on Software Measurement and Int. Conf. on Software Process and Product Measurement (IWSM Mensura 2019), Haarlem, 2019, pp. 15–30.
22 Shin, M.; Goel, A.L. Empirical data modeling in software engineering using radial basis functions. IEEE Trans. Software Eng.; 2000; 26, pp. 567-576. [DOI: https://dx.doi.org/10.1109/32.852743]
23 Linda, M.; Laird, M.C.B. Software Measurement and Estimation: A Practical Approach; 2006; New York, Wiley:
24 Koch, S.; Mitlöhner, J. Software project effort estimation with voting rules. Decision Support Syst.; 2009; 46, pp. 895-901. [DOI: https://dx.doi.org/10.1016/j.dss.2008.12.002]
25 De Lucia, A.; Pompella, E.; Stefanucci, S. Assessing effort estimation models for corrective maintenance through empirical studies. Inf. Software Technol.; 2005; 47, pp. 3-15. [DOI: https://dx.doi.org/10.1016/j.infsof.2004.05.002]
26 Hill, J.; Thomas, L.C.; Allen, D.E. Experts’ estimates of task durations in software development projects. Int. J. Project Manag.; 2000; 18, pp. 13-21. [DOI: https://dx.doi.org/10.1016/S0263-7863(98)00062-3]
27 Hartenstein, S., Johnson, S.L., and Schmietendorf, A., Towards a fast cost estimation dupported by large language models, 2024. https://cosmic-sizing.org/publications/fast-cost-estimation-by-chatgpt/.
28 The COSMIC Functional Size Measurement Method: Measurement Manual, v. 5.0 ed., 2021. https://cosmic-sizing.org/measurement-manual/.
29 OpenAI, Achiam, J., Adler, S., and Agarwal, S., GPT-4 System Card, 2024. arXiv:2303.08774
30 Vogelezang, F.; Heeringen, H.V. Benchmarking: Comparing Apples to Apples; 2019; Berkeley, CA, Apress:
31 Vogelezang, COSMIC Group, Early Software Sizing with COSMIC, Practitioners, v.4.0.2, 2020. https://cosmic-sizing.org/publications/early-software-sizing-with-cosmic-practitioners-guide/.
32 Vogelezang, COSMIC Group, Early Software Sizing with COSMIC: Experts Guide, v.4.0.2, 2020. https://cosmic-sizing.org/publications/early-software-sizing-with-cosmic-experts-guide/.
33 Sánchez Alonso, S., Sicilia Urban, M.Á., and Rodríguez García, D., Ingeniería del software: Un enfoque desde la guía SWEBOK, 1a ed., Garceta, 2011.
34 Symons, C.R., et al., Course Registration (‘C-REG’) System Case Study, v2.0.1, 2018. https://cosmic-sizing.org/publications/course-registration-c-reg-system-case-study-v2-0-1/.
35 Gérançon, B., Trudel, S., Kkambou, R., and Robert, S., Software functional sizing automation from requirements written as triplets, Proc. 16th Int. Conf. on Software Engineering Advances, ICSEA 2021, Barcelona, 2021.
Copyright Springer Nature B.V. Dec 2024