Content area
The implementation phase is one of the most critical periods in software development. Developers build their source code or reuse old source code functionalities concerning the requirement of the system. Most developers spend more time searching and navigating old source codes than developing them. It is essential to have an efficient method to search source code functionality within a short period. Topic modeling of source code is an approach used to extract topics from source codes. Many topic modeling approaches have been implemented using statistical techniques, which have many setbacks. Those results rely on non-formal code elements such as identifier names, comments, etc. Our novel approach is implemented using a machine-learning algorithm to address these issues. The source code functionality results depend only on the algorithm or the syntax of the source code. Three Java project functionalities, such as prime number, Fibonacci number, and selection sort were evaluated in this study. Java parser library is used to derive the source code elements, and an algorithm is created to take the count matrix of the source code features. Then the dataset was fed to three models-Artificial Neural Network (ANN), Random Forest (RF), and Ensemble Approach. It was found that the Ensemble Approach showed a 96.7% accuracy by surpassing ANN and RF.
The implementation phase is one of the most critical periods in software development. Developers build their source code or reuse old source code functionalities concerning the requirement of the system. Most developers spend more time searching and navigating old source codes than developing them. It is essential to have an efficient method to search source code functionality within a short period. Topic modeling of source code is an approach used to extract topics from source codes. Many topic modeling approaches have been implemented using statistical techniques, which have many setbacks. Those results rely on non-formal code elements such as identifier names, comments, etc. Our novel approach is implemented using a machine-learning algorithm to address these issues. The source code functionality results depend only on the algorithm or the syntax of the source code. Three Java project functionalities, such as prime number, Fibonacci number, and selection sort were evaluated in this study. Java parser library is used to derive the source code elements, and an algorithm is created to take the count matrix of the source code features. Then the dataset was fed to three models-Artificial Neural Network (ANN), Random Forest (RF), and Ensemble Approach. It was found that the Ensemble Approach showed a 96.7% accuracy by surpassing ANN and RF.
Keywords: Artificial Neural Network (ANN), Random Forest (RF), Ensemble approach, Source code, Classification, Programming keywords
(ProQuest: ... denotes formulae omitted.)
Introduction
Source code is the core component of a software program. It is easily readable and understandable to a human being. Each programming language has its syntaxes. Programming language syntax provides some rules which govern the architecture of a programming language's characters, algorithm, and keywords. The semantics of a programming language is practically hard to comprehend without syntax. Many programmers spend considerable time examining or changing the existing code (Corley et al., 2012). It is crucial to realize that comprehending the current source code is required before changing things much of the time, and a significant amount of effort might be wasted just attempting to find out the existing systems. This time may be better spent implementing essential functionality, improving source code quality, or doing more productive things. As a result, it is crucial to make source code searching as simple as possible.
Source code evaluation is an essential and preparatory stage in many software engineering (Bi et al., 2018) fields that are necessary to accomplish operations such as software maintenance, program modification, and source code syntax extraction (Singh et al., 2013). In the software industry, reusability is a trending topic. It is seen as a potential method of increasing software development quality and productivity. Program requirements, documentation, and source code functionalities are examples of software modules created in development and may be reused afterwards.
The source code of software systems frequently comprises code blocks with a significant level of resemblance due to reusability (Nguyen et al., 2016). This might lead to source code updates and fixes being made repeatedly. Finding and integrating reusable source code for new projects needs to be less expensive than creating a software system from scratch (Banker et al., 1993). This stresses the importance of efficiently locating and reusing program source code from legacy systems in new projects. Extensive software development projects have many lines of code, and developers would be unable to read the entire code during maintenance (Lakhotia, 1993). Developers must discover how to identify source code elements, including class names, packages, and procedures. Consequently, they will be able to identify relevant tasks and use the attributes as required (Haiduc et al., 2010).
Reading merely the classes, method names, and identifiers may not have enough information about the functionality and purpose of the source code. Reading the whole source code implementation will likewise take quite a long time. Some programmers use comments as a preferred technique for code maintenance while designing source code. However, looking through the entire software and comments for complex software systems would be difficult and time-consuming.
Software programmers' work will be more accessible if they can forecast source code topics or function names without involving humans. With the advent of technology, it is now possible to use machine learning techniques to develop a solution for the challenges mentioned above. We can reliably prescribe a strategy to forecast source code themes by looking over the complete source code. Machine Learning (ML) is an artificial intelligence application that allows software programs to predict better results utilizing related data. Essentially, machine learning analyzes input data and creates an algorithmic model that uses statistical hypotheses to determine the output.
Furthermore, with the rapid expansion of component repositories, retrieving relevant components has become problematic. The source code of a programming language is only an unstructured string. We need to find a solution to extract programming language syntax features. A parser library is often used in such cases. It provides an Abstract Syntax Tree (AST) of a source code. A syntax tree of source code provides an invocation statement with a component reference as the destination. ASTs act as data structures for software artefact models that describe the kinds of language constructs, their compositional links to other linguistic forms, and the collection of fundamental and secondary attributes for each language structure. The AST is formed from the analysis of software artefacts and offers a means of representing such objects.
Topic modeling is an important study area in machine learning and knowledge discovery. From a programming perspective, the topic modeling method determines the semantic structure of source code. Lately, source code has been analyzed and modeled using topic modeling approaches (Mahmoud and Bradshaw, 2017). These methods use the textual information of source code to provide autonomous help for several common software development tasks. Considering these advancements, topic modeling approaches in software development are often poor. This is because the latest configuration topic modeling approaches require a lot of data (Poursabzi-Sangdeh et al., 2021). Using various text mining approaches, the importance of model interpretability can be investigated.
Most of the topic modeling approaches used for source code analysis are statistical topic modeling techniques, so they cannot provide logical results. Moreover, the performance, operational and multi-parameter standardization frequently connected with traditional topic modeling approaches raise serious doubts regarding their suitability as source code analysis models in software development.
Ensemble methods, an important research topic, try to combine the predictions of different learning algorithms to establish a comprehensive classification model that substantially improves predictive accuracy. Classification algorithms can benefit from ensemble learning to improve their prediction accuracy. This drives studies to seek retrieval approaches that provide effective help for discovering and selecting relevant components of a source code that respond to user queries. A survey on extracting semantic functionality names in source code is a system comprehension exercise that has the potential to assist a developer in becoming more comfortable in searching a software system functionality.
Various examinations of source code have already been undertaken to decide sophistication and other variables. We investigated the performance of Java parser-based feature extraction in source code categorization in this research. There are some tasks that computers can perform better than humans, but when it comes to rational thinking, inspiration, and inventiveness, our amazing brains still win. Artificial Neural Networks (ANNs), prompted by the brain's structure, are the key to making technology more human-like and aiding machines in processing in the same way humans do. A Random Forest (RF) is a supervised machine learning approach for solving classification problems. It employs ensemble learning, which is a method of resolving complex issues by integrating several classifiers. The ensemble learning techniques used in this study are ANN and RF. Ensemble Approaches ANN and RF are two alternative learning approaches that may be utilized in comparable applications. Both RF and Neural Networks are Machine Learning approaches mainly used for classification problems. In sum, the experimental analysis aims to answer the following study topics:
* Using a machine learning model, how can an ensemble learning method be applied to enhance the prediction accuracy of source code classifiers?
* What are the existing algorithms used to identify the semantics of the source code function?
* What is the source code feature extraction method?
* Which configuration of classification algorithms shows promising results for source code classification?
The current topic modeling algorithms, such as Latent Dirichlet Allocation (LDA) (Reddivari and Khan, 2019), do not extract topics from source code syntax. Because of this constraint, there is a wide range of challenges in relying on the results. Our suggested approach is used to anticipate semantic topics classification from source code by going through the complete source code word by word.
The primary purpose of this study is to use ensemble algorithms to forecast source code functionality names and then compare those techniques to identify the optimal model. To accomplish this, we initially gathered Java software projects of three functionalities, namely, "Prime Number", "Fibonacci Number", "Selection Sort" from the Git open-source repository. Then we extract programming language keyword features using the Java parser library. Finally, by optimizing a new algorithm, we convert the source code data to numerical data to feed it into our ensemble model.
The following is how the rest of the paper is organized: In Section 2, relevant pieces of literature are discussed. Section 3 describes the proposed methodology; in Section 4, the study results are presented and discussed; finally, conclusion and future scope are presented in Section 5.
2. Literature Review
Multiple software development topic modeling research has looked for ways to optimize topic modeling methodologies (Zhao et al., 2020) in various software engineering activities. Statistical learning techniques are used in most of these cases to find combinations of settings that attain acceptable effectiveness in certain software engineering activities. In Aspect-oriented programming, Baldi et al. (2008) introduced a novel theory of aspects based on topic modeling. According to this approach, source code issues are latent topics that may be retrieved automatically utilizing statistical topic modeling approaches. The authors were able to discover subjects that arose as general-purpose elements across numerous projects and project-specific problems using LDA. To trace specifications to their solutions, Oliveto et al. (2010) employed LDA as a retrieval mechanism. The authors analyzed the effectiveness of LDA with the Vector Space Model (VSM), Jensen Shannon Model (JSM), and Latent Semantic Indexing (LSI), which are all often utilized in traceability studies.
To index and evaluate source code texts as a combination of probabilistic subjects, Kuhn et al. (2007) used the LDA approach. Using this method, open-source software products are automatically categorized. The efficiency of LDA-based representation in text classification was investigated by Tian et al. (2009). When LDA is compared to various feature selection approaches, it outperforms them. The TF-IDF technique weighs the words in the comparative assessment, and support vector machines are used as the base learners. Ramage et al. (2009) proposed a labeled LDA model to learn word tag correlations that integrate labels and subject priors. The effectiveness of the vector space, LSI, and LDA methods on text categorization were empirically assessed by Liu et al. (2011).
TopicXP an Eclipse plug-in for retrieving, analyzing, and displaying unstructured data in source code identifiers and comments, was presented by Tian et al. (2014). TopicXP is designed to assist developers performing maintenance activities in better understanding the system and locating relevant results. Gethers et al. (2011) demonstrated source code topics, an Elipse plug-in for capturing and visualizing the connection between source code and other software artefacts like specifications and design papers. Liu et al. (2020) evaluated LSI and LDA to the Generative Vector Space Model (GVSM), a machine learning technique to find the best performing algorithm for documentation tracing to source code in bilingual projects.
Topic modeling approaches are primarily employed in source code assessment to reduce the complicated text data of programming languages to more coarse-grained, presumably easier to understand, and composite representations of topics. Maskeri et al. (2008), for example, proposed a human-assisted strategy based on LDA for extracting business domain themes from source code. The primary goal is to assist novices in understanding how massive legacy software applications work. Tian et al. (2014) used topic modeling to automatically classify software applications in open-source communities. The authors used LDA to retrieve topics throughout multiple programming languages. These subjects were grouped into relevant groups based on word distributions, which reflect different sorts of software. A lot of these flaws were resolved in follow-up publications to the original LDA study. The Dirichlet topic distribution is incapable of capturing correlations and assumes that words may be interchanged, and sentence structure is not represented.
According to Kuhn et al. (2007), semantic clustering is the method for identifying linguistic subjects in source code that employs Average Linkage (AL) and LSI. The suggested way uses linguistic data present in source code to group software objects with similar terminology. Sridhara et al. (2008) investigated how six alternative similarity metrics were performed when predicting the semantic relations among individual code phrases in a software application. Information, glosses, and path-based approaches that use WordNet topologies to build similarity links between words are among these measurements. When these approaches were tested against a collection of semantically related codeword pairs that were manually determined, it was shown that leveraging alternate factors of linguistic knowledge in the area of source code might provide unpredictable outcomes.
Tzerpos and Holt (2000) introduced an Algorithm for Comprehension-Driven Clustering (ACDC) method that uses various patterns in programming languages to discover clusters in the software, including the file system, processes and variables in source codes, and class body header linkages. ACDC created meaningful divisions acceptable for program understanding compared to manually generating authorized decompositions.
Psarras et al. (2019) use a vectorizer to extract themes from function names, method names, and comments, then use LDA to aggregate software's source code into separate topics. Project libraries include most of the software components that are reused. Libraries that execute the algorithms and frameworks enhance the development process. These methods necessitate a prior grasp of topic modeling methodology on the programmer's part and do not allow for modifying the group of topics. The derived topic may or may not accurately reflect the meaning or functionality of the source code. Non-formal code elements such as comments and identifier names can be just as helpful as structural information in classifying software, according to the findings. However, there are many setbacks in these approaches' results. Comments, identifier names, and non-formal code elements are not a standard in programming; rather, they are the best practice.
A developer can give any comments or identifier name as he wishes. Therefore, the topic extraction from relying on this non-formal code cannot be correct. Our approach differs from these studies in this aspect, where we classify the source code from the functionality algorithm, and non-formal principles are discarded in our research. Table 1 provides a brief comparison of existing studies concerned with this research.
3. Methodology
This paper proposes an approach to address the flaws mentioned above. We use the git repository to get the source code of Java-based programs that have three semantic functions: prime numbers, fibonacci numbers, and selection sort (part A). In the preprocessing phase, we parse the source code using the Java parser library and then produce and modify necessary source code features for each file (part B) and group the appropriate feature counts. We propose employing an ensemble technique to categorize Java source code by generating semantic features from programming language concepts and using the ANN and RF algorithms to acquire the best number of classifications. The general proposed methodology utilized to create the final prediction model is depicted in Figure 1. Below is a detailed description of each phase of the implementation stage.
3.1 Dataset's Description
The dataset for the ensemble model was taken from the Git repository. Since it is an open-source platform, developers worldwide contribute to it so that the dataset will not be biased. Since we are working only on Java projects in this study, all non-Java programming language projects are omitted. Java programming language was used to generate the data from the source code. The data comprises several types of source code matching to three functionalities of programs built by a variety of developers, namely, "Prime Numbers", "Fibonacci Numbers", and "Selection Sort". The rationale for picking these algorithms is that they all have similar programming language syntax and semantics. The dataset utilized in this study contains 450 sets of data and 23 variables. Table 2 provides a detailed description of the attributes and their data types.
Since we study and examine the complete source code, the data we acquire for our study will be qualitative. The Java parser gives an abstract syntax view of the source code to analyze the necessary aspects. Qualitative data is often, but not always, non-numerical and is described as soft data. Furthermore, this does not diminish the need for analytical insight. We must still perform a thorough examination of the information acquired. In order to manage our machine learning algorithm, we transformed qualitative data to quantitative data (e.g., Boolean operations, such as multiply, addition, and so on, are tallied to get the number of times they exist in a specific source code).
3.2 Preprocessing
Preprocessing is the most critical stage in developing a source code classification model. This work was completed using the Java parser library, which creates an Abstract Syntax Tree of the source code. The following tasks were accomplished in the preprocessing stage, as shown in Figure 2.
3.2.1 Source Code Extraction
To construct count matrices for the models, it is crucial to discover relevant programming logic-related phrases in the source code. The semantics of a source code file is organized, unlike a basic text document, and it is not realistic to presume that each phrase in the code is logically related. Programming language keywords (If, for, array, For Each, while, Boolean operators, etc.) and other programming language syntax make up a significant portion of the characters in software program files. Because the acquired source code projects from Git will take different forms and have many dependencies, we must extract the critical source code functionality attributes and filter the data. The extractor uses the GitHub project's source code as input. All non-Java files are eliminated to focus only on Java projects, and Java files are processed using Java Parser Library, which generates the Abstract Syntax Tree for each class.
3.2.2 Feature Selection
A source code contains many unstructured and raw data which need to be précised for selecting the relevant source code features to include in our study. In this stage, we extracted the programming language keywords from the Java project source code of three functionalities, namely, "Prime Numbers", "Fibonacci Numbers", and "Selection Sort" using the Java parser library. Java parser provides an Abstract Syntax Tree of the source code where we can extract node by node the source code syntax. Each node of the tree represents a construct found in the original code. We will decompose the source code data into smaller bits of programming language phrases using the Abstract Syntax Tree. Programming language keywords is an essential element in a source code algorithm functionality. The most efficient source code functionality may be found by examining the source code's optimal amount of programming language keywords, such as statements types, expressions, arrays, boolean operators, and other specialized capabilities.
Other non-formal code elements, such as comments, identifier names, package names, Javadocs, etc., are discarded. This is the novelty of our study, where the non-formal code elements are independent and removed from the study. The words in each class go through a series of adjustments. Using the Java parser method, we extracted only the necessary feature count by eliminating the number of printing statements, package names, import statements, implementation, and other non-relevant capabilities of a source code. We checked our data set thoroughly before implementing the model by filling in missing values and omitting redundant datasets. Missing and duplicating variables may cause certain inconsistencies in the final classification model's development. As a result, the missing values were substituted with a zero. Because each cell will show the number of times, each character is present in each code section. (For example, if the code section has no "IF" condition statements, then the cell is displayed as null, as a result, 0 was substituted in that cell).
3.2.3 Clustering
After extracting the programming language keywords, we cluster the source code attributes. We used the Java Parser Library to construct an algorithm to cluster the source code features and count the number of occurrences of each feature. The source code features are clustered based on the programming language keywords usage (For example, collection of "IF" statements used in a source code are clustered as cluster one; group of "FOR" loops used in the source code are clustered as cluster two; similarly all the features selected, as shown in Table 2, are clustered). It can be seen in Table 2 how the attributes and data types are converted from string to integer. Before we use the machine learning models to the source code, we must first turn it into a collection of numerical data. The words in each source code go through a series of modifications before it is converted into a CSV file to be utilized in the prediction model.
3.3 Model Implementation
We built three supervised learning models, namely, ANN, RF, and a combination of ANN and RF as Ensemble Approach using the Python programming language. We used an ANN and RF model to analyze our model first and then used an Ensemble Approach technique to compare and build it. The following methods were accomplished as part of the model implementation procedure.
* Initially, the PyCharm IDE was used to create a Python project. The machine learning requirements and third-party libraries were loaded into the project directory. The acquired dataset was then saved in the project folder.
* The dataset from Section A was interpreted using the "Pandas" software.
* Then, we examine the dataset's structure and the relevant columns and rows.
* After the data refining and interpretation stages were finished, the data set was divided into training and testing with 70% and 30%, respectively.
The dataset with all characteristics was put into supervised learning algorithms, including ANN, RF, and Ensemble Approaches to find the best classifications based on the underlying evaluation measures. The trained classifiers were tested with the testing data, and the models with the best results were picked based on accuracy, precision, recall, Fl-score, and MAE. We effectively evaluate and contrast the current and previous techniques using these basic assessment metrics.
3.4 Machine Learning Classification Models
3.4.1 Artificial Neural Network
ANNs are a form of artificial intelligence that mimics the human brain's processing of a series of stimuli to generate an output. An ANN comprises three neurons: an input, a hidden neuron, and an output neuron. The components of an ANN are neurons, which are best characterized as a weighted directed graph, as shown in Figure 3. Directed edges with weighted values represent the relationship between outputs and inputs neurons.
The weighted total of the inputs is computed by the ANN, which has a bias. A transfer function expresses this calculation using Equation (1):
...(1)
The weighted total is utilized as an input to an activation layer to construct the output. Activation functions determine whether or not a node should fire. Those fired have been the only ones making it to the output layes. We can use several activation functions depending on the task we are doing.
3.4.2 Random Forest
An RF is a supervised learning approach for solving classification and regression issues using decision tree algorithms to build it. It employs ensemble techniques to resolve complex problems by integrating many classifiers. Bagging aggregation is used to learn the RF methods. Bagging is a Meta technique that groups machine learning approaches to improve accuracy. Based on decision tree predictions, the RF model determines the outcome. It anticipates by combining the results of several tree models. As the volume of trees increases, the precision of the output improves.
3.4.3 Ensemble Approach (ANN and RF)
The ANN and RF approach employed in this study is part of an ensemble approach. Two different learning methodologies, ANN and RF, were integrated to build the Ensemble Approach (ANN and RF). Both Neural Networks and Random Forest are Machine Learning approaches. Combining both models gives us an Ensemble Approach and better results by utilizing ANN and RF models. One of the most straightforward ensemble approaches is voting. It is simple to comprehend and implement. Voting is used for classification problems, and in this study, we utilized the voting ensemble approach to predict the source code functionality.
4. Results and Discussion
4.1 Evaluation
The efficiency of our study's machine learning classification algorithms (ANN, RF, and Ensemble Approach) is measured using evaluation matrices. Accuracy (Equation (2)), Precision (Equation (3)), Recall (Equation (4)) and F1 score (Equation (5)) were used for the initial evaluation:
...(2)
...(3)
...(4)
...(5)
where
TP = Correctly predicted positive semantic topic
TN = Correctly predicted negative semantic topic
FP = Incorrectly predicted positive semantic topic
FN = Incorrectly predicted negative semantic topic
Evaluation generally requires training a model from the dataset, using the model to generate forecasts on a dataset from the testing data, and evaluating the predictions to the remaining dataset's expected values. We used the exact dataset for every model to compare and evaluate which model was more efficient for our research.
To assess the efficiency of the suggested classification algorithm for ANN, RF, and Ensemble Approach, we calculated precision, recall, and Fl-score. The findings of the evaluation metric for ANN, RF, and Ensemble approach models are shown in Tables 3, 4 and 5, respectively.
The fraction of accurately forecasted positive readings to the correctly predicted expected observations is precision. This metric determines how many of the goal attributes perform as predicted. High accuracy is linked to a low false-positive rate. The fraction of accurately predicted positive observations to all of the class's statements is called recall. The weighted average of recall and precision is the F1 measure. As a result, this score considers both false positives and false negatives. Accuracy works best when the costs of false positives and false negatives are equal.
A machine learning model is evaluated using the Mean Absolute Error (MAE) metrics. These figures indicate how accurate our projections are and how far from the actual figures are. In this situation, errors are the differences between the expected parameter values estimated by our three models (ANN, RF, and Ensemble Approach) and the absolute value. For an optimal solution, the MAE should be zero, and practically, MAE cannot get zero for real-world scenarios; instead, we need to make sure the MAE value is close to zero. MAE value is computed through Equation (6):
...(6)
where
xi is an actual value
xp is a predicted value
n is the number of observations
4.2 Performance Comparison
The proposed solution is based on a performance evaluation of several models for predicting the functionality name of source code. This type of debate has formed a consistent way to implement a practical solution towards machine learning algorithms because the target classes are closely associated variables and have been critically assessed in this extensive research.
The most basic performance statistic is accuracy, just the percentage of correctly predicted observations among all observations. While evaluating the performance, our models achieved an accuracy of 95.4%, 94.8%, and 96.7% for ANN, RF, and Ensemble Approach, respectively, as shown in Figure 4, implying that RF has lower accuracy than ANN and Ensemble Approach. And as a consequence of its superior accuracy, recall, and F1-score value, the Ensemble approach surpassed the other two algorithms and showed the best accuracy rate for a machine learning model. Figure 4 shows a summary comparison of our three models.
The MAE rates in ANN are significantly lower than in the other two models, as seen in Table 6. MAE should be 0 for an effective model. Obtaining such a result for any realistic solution is practically difficult. In this study, all three machine learning models have a relatively 0.0 error rate, as shown in Figure 5. This shows the efficiency of our models in terms of the lower error rate. The MAE rates in our Ensemble Approach and RF are relatively close to zero, even though the ANN model surpasses both a low percentage of error and an effective error rate for a classification algorithm.
Neural Networks and RF are very effective supervised learning algorithms used in many prediction models. Even though both ANN and RF give exceptional accuracy results, we wanted to enhance our final prediction model by combining ANN and RF to achieve higher accuracy. With a 96.7% accuracy, we find that the Ensemble Approach surpasses ANN and RF accuracy and has a relatively less MAE rate. Our recommended Ensemble Approach delivered the most remarkable outcomes across all performance indicators. Even though these study results are based on three source code functionality, the main idea was to see the efficiency of programming language keywords to predict the source code functionality. As the finding shows an exceptional accuracy rate, we can conclude that programming language keywords can be utilized to predict source code functionality. This novel approach predicts source code functionality names without using statistical topic modeling approaches, such as LDA, LSI, and LSA. More importantly, the feature extraction of source code elements is based on programming language keywords, and non-formal code elements are discarded because of reliability issues. The source code algorithm gives efficient and reliable output results for a source code functionality name prediction. Therefore, the prediction model built through the Ensemble Approach provides a reliable topic modeling of source code functionality.
Conclusion
In this work, we develop three machine learning models to evaluate, combine, and identify a source code's semantic functionality. We use the Java parser tool to decompose the source code and retrieve the code's features. We devised an algorithm to calculate how often the programming language keyword was used before exporting the data to a CSV file. When the dataset was fed into the ensemble model, the accuracy of our innovative topic modeling method was astounding. Although several programming syntaxes exist for the same source code, our investigation included a comparison of several literature studies. We considered programming language keywords, expressions, Boolean operators, arrays, statements, and the number of method calls. This study evaluated a dataset comprising 450 rows of data and 23 attributes. Machine learning algorithms, viz., ANN, RF, and Ensemble Approach (ANN+RF), were exhaustively examined for maximum classification accuracy throughout the implementation. According to the data, the Ensemble Approach surpassed all other algorithms with a classification accuracy of 96.7% and a low error rate of 0.053%.
Since this is a new approach to the source code topic model, there are many aspects to consider in future works. It is important to consider other programming paradigms and object-oriented programming relationships in future studies. We used an ensemble strategy to build our classification model and forecast accuracy. However, there are many more deep neural approaches and machine learning algorithms to evaluate the accuracy and produce more accurate conclusions in future studies. Even though this study focuses on Java projects, we can also use the same strategy for other programming language projects by utilizing the appropriate parsing method. We constructed this model utilizing only three semantic source code functions and investigated around 450 datasets as part of our research endeavor. We want to train our system with more datasets and linguistic functions in future study.
* Department of Computing and Information Systems, Faculty of Applied Sciences, Sabaragamuwa University of Sri Lanka, PO.Box 02, Belihuloya, Sri Lanka. E-mail: [email protected]
*· Department of Computing and Information Systems, Faculty of Applied Sciences, Sabaragamuwa University of Sri Lanka, PO.Box 02, Belihuloya, Sri Lanka. E-mail: [email protected]
*·· Department of Computing and Information Systems, Faculty of Applied Sciences, Sabaragamuwa University of Sri Lanka, PO.Box 02, Belihuloya, Sri Lanka; and is the corresponding author. E-mail: [email protected]
References
1. Baldi P F, Lopes C V, Linstead E J and Bajracharya S K (2008), "A Theory of Aspects as Latent Topics", ACM Sigplan Notices, Vol. 43, No. 10, pp. 543-562.
2. Banker R D, Kauffman R J and Zweig D (1993), "Repository Evaluation of Software Reuse", IEEE Transactions on Software Engineering, Vol. 19, No. 4, pp. 379-389.
3. Bi T, Liang P Tang A and Yang C (2018), "A Systematic Mapping Study on Text Analysis Techniques in Software Architecture", Journal of Systems and Software, Vol. 144, No. 10, pp. 533-558.
4. Corley C S, Kammer E A and Kraft N A (2012), "Modeling the Ownership of Source Code Topics", Paper Presented at the 20th IEEE International Conference on Program Comprehension (ICPC).
5. Gethers M, Savage T, Di Penta M et al. (2011), "Codetopics: Which Topic Am I Coding Now?", Paper Presented at the Proceedings of the 33rd International Conference on Software Engineering.
6. Haiduc S, Aponte J, Moreno L and Marcus A (2010), "On the Use of Automated Text Summarization Techniques for Summarizing Source Code", Paper Presented at the 17th Working Conference on Reverse Engineering.
7. Kuhn A, Ducasse S and Gîrba T (2007), "Semantic Clustering: Identifying Topics in Source Code", Information and Software Technology, Vol. 49, No. 3, pp. 230-243.
8. Lakhotia A (1993), "Understanding Someone Else's Code: Analysis of Experiences", J. Syst. Softw., Vol. 23, No. 3, pp. 269-275.
9. Liu Z, Li M, Liu Y and Ponraj M (2011), "Performance Evaluation of Latent Dirichlet Allocation in Text Mining", Paper Presented at the Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).
10. Liu Y, Lin J and Cleland-Huang J (2020), "Traceability Support for Multi-lingual Software Projects", Paper Presented at the Proceedings of the 17th International Conference on Mining Software Repositories.
11. Mahmoud A and Bradshaw G (2017), "Semantic Topic Models for Source Code Analysis", Empirical Software Engineering, Vol. 22, No. 4, pp. 1965-2000.
12. Maskeri G, Sarkar S and Heafield K (2008), "Mining Business Topics in Source Code Using Latent Dirichlet Allocation", Paper Presented at the Proceedings of the 1st India Software Engineering Conference.
13. Nguyen H A, Nguyen A T and Nguyen T N (2016), "Using Topic Model to Suggest Fine-grained Source Code Changes", Paper Presented at the IEEE International Conference on Software Maintenance and Evolution (ICSME).
14. Oliveto R, Gethers M, Poshyvanyk D and De Lucia A (2010), "On the Equivalence of Information Retrieval Methods for Automated Traceability Link Recovery", Paper Presented at the IEEE 18th International Conference on Program Comprehension.
15. Poursabzi-Sangdeh F, Goldstein D G, Hofman J M et al. (2021), "Manipulating and Measuring Model Interpretability", Paper Presented at the Proceedings of the CHI Conference on Human Factors in Computing Systems.
16. Psarras C, Diamantopoulos T and Symeonidis A (2019), "A Mechanism for Automatically Summarising Software Functionality From Source Code", Paper Presented at the IEEE 19th International Conference on Software Quality, Reliability and Security (QRS).
17. Ramage D, Hall D, Nallapati R and Manning C D (2009), "Labeled LDA: A Supervised Topic Model for Credit Attribution in Multi-labeled Corpora", Paper Presented at the Proceedings of the Conference on Empirical Methods in Natural Language Processing.
18. Reddivari S and Khan M S (2019), "VisioTM: A Tool for Visualizing Source Code Based on Topic Modeling", Paper Presented at the IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC).
19. Singh P, Singh S and Kaur J (2013), "Tool for Generating Code Metrics for C# Source Code Using Abstract Syntax Tree Technique", ACM SIGSOFT Software Engineering Notes, Vol. 38, No. 5, pp. 1-6.
20. Sridhara G, Hill E, Pollock L et al. (2008), "Identifying Word Relations in Software: A Comparative Study of Semantic Similarity Tools", Paper Presented at the 16th IEEE International Conference on Program Comprehension.
21. Tian K, Revelle M and Poshyvanyk D (2009), "Using Latent Dirichlet Allocation for Automatic Categorisation of Software", Paper Presented at the 6th IEEE International Working Conference on Mining Software Repositories.
22. Tian Y, Lo D and Lawall J (2014), 'Automated Construction of a Software-Specific Word Similarity Database", Paper Presented at the Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE).
23. Tzerpos V and Holt R C (2000), "ACCD: An Algorithm for Comprehension-Driven Clustering", Paper Presented at the Proceedings Seventh Working Conference on Reverse Engineering.
24. Zhang W E, Sheng Q Z, Abebe E et al. (2016), "Mining Source Code Topics Through Topic Model and Words Embedding", Paper Presented at the International Conference on Advanced Data Mining and Applications.
25. Zhao N, Chen J, Wang Z et al. (2020), "Real-Time Incident Prediction for Online Service Systems", Paper Presented at the Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.
Copyright IUP Publications Mar 2022