Named Entity Recognition for Code Review Comments

Abstract

This paper addresses the problem of named entities recognition from source code reviews. The paper provides a comparative analysis of existing approaches and proposes its own methods to improve the quality of problem solving. Proposed and implemented improvements include: methods to deal with data imbalances, improved tokenization of input data, the use of large arrays of unlabeled data, and the use of additional binary classifiers. To assess quality, a new set of 3000 user code reviews was collected and manually labeled. It is shown that the proposed improvements can significantly increase the performance measured by quality metrics, calculated both at the token level (+22%) and at the entire entity level (+13%).

Full text

Translate

Turn on search term navigation

INTRODUCTION

Named entity recognition (NER) is an important task in the field of natural language processing (NLP). This is a complex task consisting of two parts:

(1) determine whether a word or phrase is a named entity;

(2) determine the class to which this entity belongs.

The set of classes is not fixed and depends on the formulation of a specific task: names, names of cities, companies or organizations are often searched for. However, NER can also be used for more specialized areas, for example, reviews and comments on source code (source code review). In this case, the set of classes will be specific and words associated with the source code on which the comment was written will be of greater interest. Recognizing such words will help to better understand the meaning of the review, which, in turn, can be used in classification tasks, review clustering, semantic comparison and search.

When solving the problem of extracting named entities, it is necessary to break the text into a sequence of semantic units–tokens, which will be further classified. Typically, tokens are individual words and punctuation marks. Thus, a named entity can consist of multiple tokens, just as an organization name can contain multiple words. Tokenization, the process of dividing into tokens, can be more complex than producing a sequence of words separated by whitespace, and depends on the region to which the text belongs. For source code fragments, the rules for dividing into semantic units may depend on the syntax of the programming language.

There are many different approaches to solving the NER problem. The simplest approaches are those based on the use of dictionaries, or checking the text for the presence of any patterns. It is possible to use statistical models for sequence classification, such as hidden Markov models or conditional random fields. Since NER is a classification problem, support vector machines can be used. But the best results are obtained by approaches using neural networks. This is mainly due to the ability of neural networks (e.g., recurrent or transformers) to make efficient use of context. In addition, the use of word vector representation models allows us to store more information about tokens in the text.

The basis of the implemented classifiers is the BERT family model [1] (bidirectional encoder representations from transformers) – a pretrained language model based on the Transformer architecture. Because of its architecture BERT is able to capture contextual information. This model can be used to solve many different natural language processing problems, including the tool of the named entity recognition task. In this work, BERT family models were used to train a token classifier on a new dataset.

One of the related works in this direction is SoftNER [2], in which the authors researched and developed a classifier of named entities for messages from the StackOverflow1 forum. The main results of the work include the use of an intermediate vector representation for a token from different models as parts of a single classifier. This approach was taken as the basis for further research as the best one presented.

The effectiveness of all implemented methods was also assessed on a new data set and an error analysis of classifier was carried out.

The main contributions of this work are:

• a set of 3000 reviews of the source code collected and labeled into 15 classes;

• proposed methods for improving the performance of classifiers;

• trained classifiers that recognize 31 types of entities (in BIO notation) related to software development.

METHODS FOR RETRIEVING NAMED ENTITIES

The task of token classification in text is quite common and well-studied. Over time, numerous approaches have been proposed to solve this problem. Various features have been used to describe words, such as: form, lemma, domain of surrounding words in a sliding window, reference information, statistical data, use of capital letters, punctuation, and so on. The use of artificial neural networks not only expanded the list of numerical word and part-of-speech features but also improved the quality of word determination and classification in text [3].

Hidden Markov Models

In work [4], hidden Markov models were used for recognizing and classifying names, dates, times, and numerical values in MUC-6, MUC-7, and news broadcast datasets. The developed model, when trained on a dataset of 100 000 words, learned to classify with an accuracy of 94%. Here, the NER task is considered as the task of assigning each word one of the proposed labels or a special label NOT-A-NAME to indicate that the word does not belong to any of the labels. Then, a probabilistic model is used to calculate the likelihood of words appearing in the context of bigrams.

In a more formal formulation, the task consists of finding the most likely sequence of name classes (NC) for a given sequence of words (W): P(NC|W).

Conditional Random Fields

Conditional random fields (CRF) were introduced by the authors of work [5] as a tool for statistical modeling of pattern recognition tasks. In work [6], a variant of using CRF for named entity extraction was proposed.where – input sequence (words), s = ⟨s₁, s₂, …, s_T⟩ – sequence of states (corresponding to the input sequence labels), Z – some normalizing coefficient, f_k(s_t_{– 1}, s_t, o, t) – arbitrary feature-function, β_k – trainable features weight. Transition values between two states can be calculated using dynamic programming algorithms. In the work, the proposed solution achieved 84.04% for English and 68.11% for German languages at the CoNLL 2003 competition.

Support Vector Machine

The support vector machine (SVM) method was presented by Cortes and Vapnik [7]. The algorithm is based on the idea of a separating linear hyperplane, which tries to maximize the distance from this plane to the nearest points of both classes. In work [8], the application of the SVM method to the named entity extraction task is described. The essence is to create 8 classifiers, one for each class. A vector of features consists of more than 1200 binary labels for each word. When describing each word, its context of size 7 (3 words before and 3 after) is also used. Having the results of classification of 8 models, the label for a word is determined based on the confidence coefficients of the models. If none of them are confident that the word belongs to the corresponding class, the label “O” is assigned.

Neural Networks

With the spread of machine learning and deep neural network methods for natural language processing tasks, they have also begun to be applied to the NER task. Since recurrent neural networks work with sequential data, they are well suited for natural language processing. Moreover, the long short-term memory (LSTM) networks developed later are capable of “remembering” long dependencies in text, improving the overall understanding of the text. In work [9], it was shown that BiLSTM-CRF-based solutions demonstrated the best results on the studied datasets (in the CoNLL 2003 set for English, 90.94% F1-score; for German, 78.76% F1-score; CoNLL 2002 for Spanish, 86.75% F1-score). The use of artificial neural networks in the NER task is not limited to the traditional domain of detecting entities of classes ORG, PERSON, DATETIME, but is also used in the medical field [10] for finding disease names, symptoms, and pharmaceutical preparations. Transformer-based models, which have been actively used in many natural language processing tasks since their inception, are based on the attention mechanism [11]. This allows for capturing dependencies between distant words in a sentence. It turned out that for named entity extraction in most areas, models based on this architecture outperform others [12]. In work [13], the application of BERT models for solving the named entity extraction task from texts on cybersecurity in Russian is described. The authors propose a data augmentation method to obtain a larger number of named entity examples. One of the ways to improve the classifier’s performance is to train a large language model on a specific set of texts. The closest task to ours is the classification of named entities related to software in StackOverflow question-answer pairs, considered in the article [2]. The basis of the proposed solution is deep neural networks and supervised learning. Several word vector representation models are considered (ELMo, GloVe, BERT). Two methods for improving the quality of classification are also proposed: pre-training the model that encodes words into vectors; using additional models that provide vector representations of words. The best configuration among those considered was the use of a pretrained BERT model (BERTOverflow) trained on a large set of unlabeled StackOverflow messages and the use of intermediate vectors from two models as additional features for the main classifier. The creation of an open manually labeled dataset can also be considered an advantage of the work.

LABELED CODE REVIEW COMMENTS DATASET

In this section, the process of creating the CodeReviewCommentsNER dataset is described, which is publicly available on Zenodo2. There are not many solutions for extracting named entities for code review comments or similar domains [14, 15]. The only available dataset related to software development with detailed labeling is the SoftNER set [2]. A positive aspect is the similar subject area, namely user comments from the StackOverflow forum. An undeniable advantage is the presence of tags related to source code: Class, Variable, Function, Value, Data Type.

The drawbacks of the SoftNER dataset include tokenization rules. For example, the expression «size(S.vertices) + 1» is considered a single token with the “Code_Block” label, and similarly, «as.numeric(df$EventName)» with the “Library_Function” label, although from our point of view, both expressions can be split into smaller method calls, variable usage, and constant values.

Also, not all proposed classes are clearly defined, relevant, and frequently used in code review comments. For example, the “IN LINE CODE” class will always overlap with other code-related classes and can be used more as a generalizing class in hierarchical classification.

Moreover, the proposed dataset has an entity distribution that is not similar to that obtained from code review comments. For example, from Table 1, it can be seen that the number of variable names in the proposed dataset is 13 times smaller than in the analyzed code reviews. Thus, due to the difference in the subject area and disagreements in the data labeling, it was necessary to create our own dataset.

Table 1. . Percentage of important classes in the dataset

Data set	CodeReview CommentsNER	SoftNER
Variable	2.7%	0.2%
Value	1.8%	0.8%
Function	1.5%	0.5%
External_Tool/Application	0.3%	1.1%

For analysis and labeling, 3000 code review comments were collected from various open-source projects (from 1000 Github3 Java, Python, C/C++ projects, from Android Open Source Project4 Gerrit systems, Tizen Gerrit5). 2567 comments were randomly selected from the total base of 420 000 comments. 433 comments were selected based on the presence of keywords to increase the number of representatives of some classes from the same general base.

Classes Description

As the initial set of classes, 20 proposed by the authors of the SoftNER work were considered: CLASS, VARIABLE, IN LINE CODE, FUNCTION, LIBRARY, VALUE, DATA TYPE, HTML XML TAG, APPLICATION, UI ELEMENT, LANGUAGE, DATA STRUCTURE, ALGORITHM, FILE TYPE, FILE NAME, VERSION, DEVICE, OS, WEBSITE, USER NAME. Such set of labels requires modification for code reviews, as mentioned before.

The meaning of the APPLICATION class was expanded and transformed into External_Tool. The classes UI ELEMENT, LANGUAGE, DATA STRUCTURE, ALGORITHM, VERSION, DEVICE, and USER NAME were removed. The Error_Name class was added.

The final set of 15 classes is presented in Table 2.

Table 2. . List of class labels with description

Class name	Description	Examples
Variable	names of variables and objects	logicColumn, Position, mount_point_count, MS_BIND
Function	names of functions, methods and procedures, without capturing characters “()”	_handle_fromlist, EXPECT_THAT, getInput, ForEach
Class	names of classes, structures	DisplayVk, VirtualHost, ImagePipeline, ImporterMesh, UniformLocation
Value	constant strings, including “, numbers, Boolean values, values “null”, “None”	“1px solid grey”, “<classpathentry …>”, "0 0,1,2,3 11 19 AUG ? 2018", 0.7, 30, 1, 2, null
File_Name	file name with extension, full path to the file/directory	testing/tf.py, wine_data.csv, /data/app/lib/x86/libgame.so
File_Type	mention of the file extension, without taking into account the full name	GIFs, webp, .jar, binary
Keyword	keywords used in programming languages	PUBLIC, ifndef, unsigned, class, interface, else, select, yield
Data_Type	the name of data types, including user types and structures, if used as a type	unsigned char, Float, ConstBytes, TypeName, Vector, LandmarkPoint, UInt64
Library_Package	references to loaded libraries/modules	com.google.gerrit.httpd.rpc.change.ListChangesServlet, log4j, mlflow.projects.backend.local._create_virtualenv
Error_Name	the name of the error class or its name	NoManifestException, IOException, InvalidOperation, ImportError, Unimplemented, Exceptions, NPD, ANR, OOM, out of memory
HTML_XML_Tag	html/xml tag with symbols <>	<classpathentry kind="output">, <body>, <c1>, </style>, <svg height="20" width="20">
Operating_System	names of operating systems	android, Chrome OS, Linux, Unix,SerenityOS, Win 8, bsd, arch, selinux
Programming_ Language	names of programming languages	C++17, xml, cpp,c++, R, pythonic
External_Tool	names of third-party applications that can be run as a self-contained tool	Qt, travis, alibaba-dubbo, pylint, JVM, dockerd, isoltest
Website	links to web resources	https://www.example.com/, https://www.example.com/somepage\#heading2, https://example.com/file.png

The first 100 comments were labeled by three specialists, the average kappa Cohen agreement coefficient was 0.82, indicating almost complete agreement between the labelers. After analyzing the disagreements on some classes, additions and examples of agreed-upon named entities were made to the labeling instructions. Further labeling was done without cross-checking, and partial cross-validation was performed, showing a high agreement coefficient. The main type of disagreements was the omission of named entities, which accounted for less than 0.1% of the total number of tokens. For the final set of labeled data, a re-review of the comments was conducted, and missed labels were corrected.

Review Tokenization

One of the distinguishing features of our data is the specificity of splitting the text into tokens. Splitting by spaces alone is not the correct approach, as source code often contains many meaningful objects separated by various symbols: dots, brackets, etc., and programming language syntax rules also affect the way code is written. Thus, there is a need to create tokenization rules that take into account all these features of the review text without creating an excessive number of tokens.

Using regular expressions, a tokenization method was created that meets these requirements. The main principles are:

• separating alphanumeric characters from non-alphanumeric characters;

• separating quotes from other symbols;

• separating dots and commas from alphanumeric characters;

• separating brackets from nonbracketed symbols.

Because of the relatively small number of rules, it was possible to maintain high tokenization speed and obtain a split that closely matches manual labeling and is suitable for splitting code snippets regardless of the programming language used.

The data is passed to the model after preliminary tokenization. These tokens are then further divided into smaller tokens that can be represented by the BERT model to obtain its vector representation.

Data Characteristics

In total, 3000 reviews contained 94 328 tokens. When dividing into training and test sets, the rule was followed that 10% + 1 example of each class would be included in the test set, in order to avoid a situation where examples of small classes are not in the test set. Some general statistics are given in Table 3.

Table 3. . General characteristics of the data set

	Total	Train	Test
Number of samples	3000	2543	457
Number of tokens	94 328	80 838	13 490
Number of tokens with entities	15420	13429	1991
Mean number of entity tokens per sample	5.1	5.3	4.4
Number of entities	8514	7241	1285
Mean number of entities per sample	2.8	2.8	2.8

The structure by entity types is presented in Table 4. Based on the data in the table, two observations can be made: there are significantly fewer entity tokens than tokens without any entity; some entities are not that many, but they are long and consist of a large number of tokens (Website, File_Name).

Table 4. . Entity data structure

Entity type	Tokens number	Entities number
O	77 079	77 079
Variable	5089	2525
Website	3669	161
Function	3655	1429
Value	1548	823
Class	1328	666
File_Name	1169	174
HTML_XML_Tag	1027	287
Library_Package	955	278
Keyword	848	815
Data_Type	741	514
External_Tool	347	259
Error_Name	272	131
File_Type	241	154
Operating_System	193	160
Programming_Language	172	138

BASIC SOLUTION

Basic Models

Models from the bidirectional encoder representations from transformer (BERT) family, namely RoBERTa [16] and CodeBERT [17], were chosen as a basic solution. The first one has proven itself well in various natural language processing (NLP) tasks. CodeBERT is a model based on RoBERTa, but trained on source code examples with documentation to build a semantic connection between the source code and a natural language description of this code.

In addition, as noted in the SoftNER work, pre-trained models on texts from the desired area improve the final results. Since the BertOverflow model obtained by the authors is publicly available, it was also used in the experiments.

For pre-training on source code reviews, about 420 thousand unique comments were taken from various open source projects. The training was carried out on the Masked Language Model (MLM) task, when some of the tokens are masked with a specialized token, and the model is asked to restore the missing token.

CodeBERT is an already pretrained model on its own data set, so we will train it from scratch on our own set of messages (PretrainCodeBert).

Thus, there are 4 basic models with which we will conduct further experiments: roberta-base, codebert-base, BertOverflow, PretrainCodeBert.

For all classifiers using only one BERT-like model, the following architecture was used:

• BERT tokenizer: turns the incoming sequence into model dictionary tokens and their indices, understandable by the BERT encoder;

• BERT encoder: for each input element, returns a vector representation consisting of 768 numbers;

• Classification layers: a set of neural network layers in various configurations that solve the problem of classification into N classes based on the input feature vector.

The standard loss function for multiclass classification problems is CrossEntropy. We used the Adam optimizer with a fixed learning_rate=5e-6.

Additional Tokens Splitting

The tokenization performed to obtain a vector representation for BERT consists of two stages: pretokenization and tokenization using a pretrained BERT model tokenizer. Without the use of pretokenization, BERT’s built-in tokenizer cannot break long tokens into component parts that are not in the BERT model vocabulary. The pretokenization stage is needed to obtain more meaningful partitions in the future. So, using given rules, it breaks the names of functions in which snake_case or camelCase was used into separate words, which allows separating long tokens into their component parts.

Without pretokenization we get “mAssistants” -> [‘m’, ‘Ass’, ‘ist’, ‘ants’].

With pretokenization “mAssistants” -> [‘m’, ‘Assistants’] -> [‘m’, ‘Assist’, ‘ants’] em dash information about the word “Assist” remains.

This difference arises because when the language model was trained, especially on text from non-source code areas, it did not encounter such long tokens. On the other hand, each word of such a compound token can individually exist in the dictionary. In the case of RoBERTa, tokens are often encoded with a leading space, meaning a token without a leading space has a greater chance of being interpreted incorrectly since it is not in the dictionary. Thus, pretokenization allows you to cope with this problem by conducting an initial split into suitable subtokens. Next, the pre-trained BERT tokenizer breaks down to those subtokens that are in the BERT dictionary for further use.

Class Imbalance

One of the problems of this task is the strong imbalance of classes. From the Table 4 shows that there are more than 77% of class “O” tokens, while “Error_Name” is a little more than 0.1%. There are several ways to combat such phenomena in data, namely: artificially increasing the number of examples of small classes (upsampling), reducing examples of numerous classes (downsampling), and using weighted loss functions for more sensitive model training on examples of the desired type.

Upsampling was used in the process of collecting examples of comments for CodeReviewCommentsNER, when, based on the results of marking up part of it, it became clear that some classes were not enough. To do this, more suitable examples were collected from the general set of reviews using regular expressions.

Downsampling is not very suitable, since the main advantage is the “O” class, and this is almost impossible to avoid due to the characteristics of the texts and the selected classes. For the remaining classes, the absolute number was already small and there was no point in applying downsampling to them.

FocalLoss [18] was chosen as a loss function that supports the ability to transfer weights to classes. This loss function was developed to eliminate class imbalance in computer vision tasks:where , but pactically α can be inverted and taken from the range , γ ∈ [0, 5].

The introduction of such a weighting parameter is a generally accepted method of combating class imbalance. This parameter equalizes the importance of numerous and few examples, but does not distinguish between simple and complex examples. To do this, a parameter γ is introduced that helps reduce the weight of simple examples.

The constants were selected using an empirical method γ = 2, α = .

ADDITIONAL BINARY MODELS

In the SoftNER work, two additional models were proposed that were trained for binary classification of “O” and “Entity”. The penultimate linear layer of the classifier was then used as part of the token vector representation in the main model. One of the models is based on the BERT architecture (Segmenter), and the second is a context-independent model that evaluates each word separately based on frequency characteristics (Recognizer). According to the authors, the use of such different models should help better classify words that occur in both regular communication contexts and source code snippets.

Contextually Independent Model

As mentioned earlier, one of the models is context-independent, meaning it processes each word regardless of the context it is in. It consists of three parts:

• a frequency dictionary built on the GigaWord6 dataset containing regular text;

• a frequency dictionary built on StackOverflow question-answer pairs;

• a Fasttext7 model trained without a teacher on a significant set of code reviews.

Next, the frequency characteristics of the dictionaries are converted into vectors using the Gaussian discretization method [19] and concatenated with the Fasttext output vector. Then, the resulting vector goes through a few fully connected linear layers of the artificial neural network. The detailed scheme is shown in Fig. 1.

Fig. 1. [Images not available. See PDF.]

Context-independent model architecture.

Since this model also depends on the dictionary and tokenization, the tokenizer from Subsection 3.2 was used for text splitting.

Contextually-dependent Model

The context-dependent model is based on the BERT architecture. As the basic multi-class classification solution, experiments were conducted with the base model. The same 4 base models described in Section 4.1 were used. The binary classifier structure repeats the structure of the multiclass classifier.

Model Interaction Methods

As mentioned earlier, in the SoftNER work, additional models were used to provide an additional part of the token vector representation, as shown in Fig. 2. When concatenating two vectors of 768 numbers, there would be a sharp change in vector size from 1536 to 31, so an intermediate layer of size 500 can be added.

Fig. 2. [Images not available. See PDF.]

Scheme for combining vector representations using the concatenation method.

Another variant of using the additional model is proposed in [20]. Two methods are considered: Classify-Verify and Classify-Trust. In the first method, if the general model predicts any answer, this prediction is compared with the predictions of the specialized model, and a decision is made about the resulting class. If the general model predicts “Absence of an answer” (which in our case can be interpreted as the “O” class), then “Absence of an answer” becomes the final answer. In the Classify-Trust method, if the general model predicts any answer, the answer from the specialized model is automatically used, and similarly to the first method – in the absence of an answer from the general model, such an answer becomes the final one.

In our configuration, the “general” model can be called the binary classification model, and the “specialized” model is the multiclass model. The Classify-Trust method (the implementation scheme is shown in Fig. 3) is applied directly according to the proposed method: if the binary model predicts “O”, the final answer is “O”, and otherwise, we look at the predictions of the multiclass model. The Classify-Verify method is more difficult to interpret, as it is unclear how to compare the predictions of the binary and multiclass models.

Fig. 3. [Images not available. See PDF.]

Using auxiliary models using the Classify-Trust method.

METRICS

The basic metrics in the NER problem are, as in classification problems in general, precision, recall, and F1-score. They are calculated for each class separately and averaged according to one of the strategies: macro, micro, weighted. For us, the most suitable strategy is macro, since according to it the overall average value is considered as the average for all classes, regardless of the number of elements in each class separately. However, these metrics are calculated on a per-token basis, and as was shown in Section 3.2, tokenizers sometimes split the original words quite heavily, thereby generating a large number of not entirely meaningful tokens. Moreover, from the point of view of the end user, it is more important to see that the classifier has completely identified the entire named entity, and not just some subtokens from it.

BIO Notation

BIO (short for beginning, inside, and outside) notation is one of the generally accepted formats for marking tokens in the task of fragmenting texts. The prefix B- before a tag indicates that the tag is the beginning of a fragment, I‑before a tag indicates that the tag is inside the fragment. The O tag indicates that the token does not belong to any of the classes. There are several varieties of such notations: BIO/IOB, IOB2, BIOES, BILOU, which differ in the presence of additional prefixes to tags or different markup rules. So, according to the rules of BIO notation, the prefix B- is assigned to a token only when the token is followed by an I- tag of the same class. When marking the data, IOB2 rules were used, where the prefix B- is always placed in the first token of a new entity.

Entity Metrics

At the 2013 International Workshop on Semantic Evaluation (SemEval) [21], 4 ways to calculate precision/reсall/f1 were introduced:

• exact match of boundaries and type;

• exact match of boundaries, regardless of type;

• partial coincidence of boundaries, regardless of type;

• there is some overlap between the predicted and correct answer.

We were interested in Strict, as the most strict and accurate method, and Type, which checks that there is at least some match with the correct answer. Type is counted in case of any intersection in positions of the correct answer and the predicted one, but only if the entity type matches. Strict is always equal to 0.0, except in the case of a complete match predicted with the correct answer.

EXPERIMENTS AND RESULTS

To assess the quality of multiclass classification models, we used the Type F1 and Strict F1 metrics described in Section 6.2, as well as the usual Token F1 – F1 measure for tokens (and since IOB2 notation is used, there are not 16, but 31 classes). Macro averaging was used for all metrics. Binary classification models were evaluated using the metrics Accuracy, Recall, and F1-score for tokens.

The experiments were carried out on our labeled data set with a fixed division into training and test samples.

Measurements were carried out on a workstation with the Ubuntu Server 20.04 LTS operating system, an Intel® Core™ i7-6700 CPU @ 3.40GHz, 32GB RAM, and an NVIDIA TITAN Xp graphics accelerator with 12GB of memory. For the Python configuration, a virtual environment with Python 3.9.16, torch==2.0.1, transformers==4.27.4 was used.

The sample group size (batch) was chosen to be 16 – the maximum size at which the training process fits into the memory of the graphics accelerator. The training process for 20 epochs takes about 15 minutes for configurations with one model and about 20 minutes when using additional models. Running on a test set of 457 comments takes 4–9 seconds depending on the configuration.

Basic Models

Below are the results of comparative training runs of 4 basic models (from paragraph 4.1) and the impact of our proposed improvements described in paragraphs 4.2, 4.3.

As can be seen from Table 5, the improvements proposed in paragraphs 4.1, 4.2 (the use of the new loss function is marked as “+ focal loss”, and the use of both improvements is marked as “+ enhance”), improve the quality of classification. The proposed improvements give the greatest increase in indicators to the most basic roberta-base, increasing Token F1 by 16.3%, and metrics for entities by an average of 7.7%. The example of codebert-base shows that the proposed improvements can reduce the results for some metrics (Strict F1). Based on the experimental results, it is clear that the use of pretokenization does not always have a positive effect on the quality of classification. This behavior can be explained by the need to write your own rules for auxiliary partitioning into tokens for each model. It can also be noted that the use of a pretrained model on texts from the corresponding subject area also improves the results: PretrainedCodeBert has better results than codebert-base, and BertOverflow outperforms roberta-base.

Table 5. . Results of multiclass classification models testing

Model	Token F1	Type F1	Strict F1
roberta-base	0.5913	0.6741	0.6145
roberta-base + focal loss	0.6692 (+13.2%)	0.7111 (+5.5%)	0.6543 (+6.4%)
roberta-base + enhance	0.6879 (+16.3%)	0.7253 (+7.6%)	0.6628 (+7.8%)
BertOverflow	0.6483	0.6963	0.6428
BertOverflow + focal loss	0.6983 (+7.7%)	0.7108 (+2.1%)	0.6606 (+2.7%)
BertOverflow + enhance	0.7091 (+9.3%)	0.7116 (+2.2%)	0.6617 (+2.9%)
codebert-base	0.6212	0.7414	0.7192
codebert-base + focal loss	0.6836 (+10%)	0.7467 (+0.7%)	0.7193 (+0%)
codebert-base + enhance	0.6943 (+11.7%)	0.7464 (+0.6%)	0.6962 (–3.2%)
PretrainedCodeBert	0.6414	0.7528	0.6954
PretrainedCodeBert + focal loss	0.7219 (+12.5%)	0.7699 (+2.3%)	0.7246 (+4.2%)
PretrainedCodeBert + enhance	0.7342 (+14.4%)	0.7695 (+2.2%)	0.7235 (+4.0%)

Additional Binary Models

This section describes the results of experiments with training additional binary classification models described in Section 5. To train the binary model, only two classes [“O”, “Entity”] without BIO notation were used. Quality was calculated using the metrics Accuracy, Recall, and F1-score for tokens. The same data was used as for multiclass classification with the labels of all entities replaced by “Entity”.

Table 6 contains the results of testing binary models. All classifiers based on the BERT architecture have approximately equal results; the classifier with PretrainedCodeBert performed better by a small margin. The context-independent Recognizer model also showed a good result of 0.8081 F1-score, but it lags far behind the context-aware classifiers.

Table 6. . Results of binary classification models testing

Model	F1-score	Precision	Recall
Recognizer	0.8081	0.7883	0.8292
roberta-base	0.9496	0.9462	0.9531
BertOverflow	0.9404	0.9328	0.9486
CodeBert	0.9414	0.9325	0.9514
PretrainedCodeBert	0.9541	0.9498	0.9585

Models Ensemble

Next, the results of experiments combining additional models with the main one, described in Section 5.3, are shown.

Table 7 shows the results of testing various model combination configurations. The main model used was roberta-base with improvements described in 4.2, 4.3 (in Table 7, it is denoted as Base, as it showed the highest sensitivity to improvements in the experiments from Section 7.1. Additional models: context-dependent Segmenter with the base roberta-base model (in Table 7 – Seg) and context-independent Recognizer (in Table 7 – Reco).

Table 7. . Results of model ensembles configurations testing

Model	Token F1	Type F1	Strict F1
Base	0.6879	0.7253	0.6628
Base + Reco -Emb	0.6872	0.7053	0.6426
Base + Seg -Emb	0.6948	0.7251	0.6783
Base + Seg -Emb 500_31	0.6773	0.7199	0.6782
Base + Seg + Reco -Emb	0.6998	0.7204	0.6825
Base + Seg + Reco -Emb 500_31	0.6791	0.7142	0.6662
Base + Seg -T	0.7127	0.7342	0.6873
Base + Reco -T	0.6073	0.6439	0.6032
Base + Seg\|\|Reco -T	0.6227	0.6424	0.6054
Base + Seg&&Reco -T	0.7211	0.7395	0.6954

Runs marked “-Emb” denote the concatenation of vector representations generated by additional models with the vector representation from Base. The notation “500_31” means the presence of an additional intermediate linear layer. As a result of concatenating vector representations, their size per token is 1536 for “Base + Seg –Emb” and 1556 for “Base + Seg+Reco –Emb”.

As can be seen from Table 7, using two additional models gives a greater increase in quality than using them separately. Although by the Type F1 metric, the run with only Segmenter has a higher result. It can also be noted that adding an additional intermediate linear layer of size 500 only worsens the classifier’s results.

Runs marked “-T” denote model combination using the Classify-Trust method, where auxiliary models act as oracles answering the question: “is this token not an entity (does this token belong to the “O” class)?”. In the case of a positive oracle response, the token is marked as “O”, and otherwise, the prediction of the multi-class classifier is considered. In the case of using two binary oracle models, their results can be either combined (if at least one oracle gives a positive response, denoted as || in Table 7) or intersected (if both oracles give positive responses, denoted as && in Table 7).

Based on the data in Table 7, it can be concluded that, when applying the Classify-Trust method, it is also better to use both auxiliary models than each of them separately. It is also worth noting that runs Reco and Seg||Reco perform worse than the others, as there are many false alarms from the oracles and a significant number of tokens are incorrectly marked as “O”. In contrast to Seg&&Reco, where the “O” label is issued more accurately by the oracle. In conclusion, it can be noted that in our experiments, the Classify-Trust method performs better than concatenating vector representations. The best result was shown by the run “Base + Seg&&Reco –T”, using both auxiliary models.

The implemented methods were tested on the dataset from the SoftNER article, specifically the data available in the public domain8. Among the published data, there is a training and test set with markup on 28 classes in the IOB2 notation. There is also a training set with a reduced set of class labels, but there is no corresponding test set. Table 8 contains the test results. It can be seen that by all three metrics, the best result is shown by the classifier with all proposed improvements.

Table 8. . Results of testing the implemented classifiers on the SoftNER dataset

Model	Token F1	Type F1	Strict F1
roberta-base	0.326	0.4806	0.3923
roberta-base + enhance	0.4971	0.5472	0.4666
roberta-base + enhance -Emb	0.4836	0.5607	0.494
roberta-base + enhance -T	0.5066	0.5716	0.4984

The best configuration for combining models “Base + Seg && Reco –T” was applied to all base models. The testing results are presented in Table 9. In each section of Table 9, the following are presented:

Table 9. . Results of models ensembles testing

Model	Token F1	Type F1	Strict F1
roberta-base	0.6879	0.7253	0.6628
roberta-base -T	0.7211	0.7395	0.6954
roberta-base -T*	0.7464	0.7805	0.7327
BertOverflow	0.7091	0.7116	0.6617
BertOverflow -T	0.6965	0.7467	0.6999
BertOverflow -T*	0.7732	0.7930	0.7586
codebert-base	0.6943	0.7464	0.6962
codebert-base -T	0.7172	0.7685	0.7175
codebert-base -T*	0.7517	0.8153	0.7629
PretrainedCodeBert	0.7342	0.7695	0.7235
PretrainedCodeBert -T	0.7347	0.7703	0.7290
PretrainedCodeBert -T*	0.7866	0.8263	0.7917

• results of the base model with improvements;

• classifier with additional models using the Classify-Trust method (denoted as “-T”);

• case of an ideal oracle using the Classify-Trust method (denoted as “-T*”).

For all classifiers, the runs marked “-T” show better results than the base models. Due to the architecture of the classifier with the Classify-Trust method, it is quite easy to conduct an experiment with an ideal oracle of binary classification (in Table 9, the runs marked “-T*”), where the current classifiers will be replaced with the true value of the token label. This experiment can obtain the maximum possible increase in indicators from the use of the Classify-Trust method. The ideal oracle adds 5 to 8 points to the F-measure for all base models. In conclusion, the best result was achieved by the classifier with the pre-trained base model PretrainedCodeBert, with improvements proposed in points 4.2, 4.3, and with auxiliary models applied using the Classify-Trust method, achieving results of Token F1 = 0.7347, Type F1 = 0.7703, Strict F1 = 0.7290.

Errors in Retrieving Named Entities

Figure 4 shows the classification mismatch matrix of the best obtained model (PretrainedCodeBert). For convenience, in the testing results, the B-Tag and I‑Tag classes were combined, resulting in a 16 × 16 matrix instead of a 31 × 31 matrix. The matrix contains normalized values based on true labels (the sum along the horizontal should be equal to one, with rounding). During manual analysis of examples where the classifier makes mistakes, several error groups can be identified.

Fig. 4. [Images not available. See PDF.]

Confusion matrix.

7.4.1. Errors in subtokens. It should be noted that classification occurs at the subtoken level, which is obtained after the RoBERTa tokenizer, and the improvements developed in this direction Subsection 4.2) do not always help. Sometimes long entities are split into many small subtokens, each of which the model needs to classify. To combat this problem, there are several approaches: (a) change the training method so that only the first subtoken is considered in the loss function calculation and the final classification; (b) combine the model’s predictions for all subtokens by some voting to obtain a single assessment for the entire entity.

7.4.2. Errors with the “O” class. They are the most common: out of 924 incorrectly classified tokens, 523 errors are related to the “O” class. From Fig. 4, it can be seen that 28% of “Extenral_tool” was classified as “O”. For example, in the comment “I ran make lint and make pylint before pushing this commit”, the token “make” was not recognized as an entity. Or in the example “*return just empty ColumnNothing otherwise”, the token “empty” was classified as Keyword, although it is not.

7.4.3.Errors in classifying between classes. The most understandable pair can be called Variable and Function (81 misclassified tokens). For example, in the comment “So I think I’ve got the current fastest implementation using bytesBefore and the underlying SWAR indexOf, thanks to you! ;-)”, the tokens “bytes” and “Before” were classified as Function, but the correct class is Variable. Or in the example “I pushed a change to use np.newaxis whenever possible”, the tokens “new” and “axis” were identified as Function when the correct answer is Variable. Another often confusing pair is Variable and Value (53 mutual errors). Since we considered string constants as Value, which are often enclosed in quotes, the following situations arise: in the comment “'propertyIdWithAreaIds’ I think it is better to be more clear with the name”, the propertyIdWithAreaIds was classified as Value, but it is a variable name.

CONCLUSIONS

In this work, methods for improving the quality of solving the named entity recognition task were studied in the context of the software development domain. The implemented approaches were compared on a new manually collected and labeled dataset of 3000 reviews left by users on changes in source code. The number of recognized classes is 15. The proposed and implemented methods for improving the quality of classification allowed the results to be raised by 8–13 points in the F1-score for different metrics. Manual testing showed quite good classification quality for applying the tool in practical tasks. The developed tool for finding and classifying entities will help better solve classification/clustering and semantic search tasks for comments in the subject area.

FUNDING

This work was supported by ongoing institutional funding. No additional grants to carry out or direct this particular research were obtained.

CONFLICT OF INTEREST

The authors of this work declare that they have no conflicts of interest.

https://stackoverflow.com/questions

https://doi.org/10.5281/zenodo.10060889

https://github.com/

https://android-review.googlesource.com/

https://review.tizen.org/gerrit/

https://catalog.ldc.upenn.edu/LDC2011T07

https://fasttext.cc/

https://github.com/jeniyat/StackOverflowNER/tree/master/resources/annotated_ner_data/

Publisher’s Note.

Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

AI tools may have been used in the translation or editing of this article.

REFERENCES

1 Devlin, J., Chang, M.W., Lee, K., and Toutanova, K., BERT: pre-training of deep bidirectional transformers for language understanding, in Proc. Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1: Long and Short Papers, Minneapolis: Association for Computational Linguistics, 2019, pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423

2 Tabassum, J., Maddela, M., Xu, W., and Ritter, A., Code and named entity recognition in StackOverflow, in Proc. 58th Annu. Meeting of the Association for Computational Linguistics, Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., Eds., Association for Computational Linguistics, 2020, pp. 4913–4926. https://doi.org/10.18653/v1/2020.acl-main.443

3 Sharnagat, R., Named entity recognition: A literature survey, June 30, 2014. https://www.cfilt.iitb.ac.in/resources/surveys/rahul-ner-survey.pdf.

4 Bikel, D.M.; Schwartz, R.; Weischedel, R.M. An algorithm that learns what’s in a name. Mach. Learn.; 1999; 34, pp. 211-231. [DOI: https://dx.doi.org/10.1023/A:1007558221122]

5 Lafferty, J.D., McCallum, A., and Pereira, F.C.N., Conditional random fields: probabilistic models for segmenting and labeling sequence data, in Proc. 18th Int. Conf. on Machine Learning (ICML’01), San Francisco, CA: Morgan Kaufmann, 2001, pp. 282–289.

6 McCallum, A. and Li, W., Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons, in Proc. 7th Conf. on Natural Language Learning at HLT-NAACL 2003, vol. 4: CONLL’03, Association for Computational Linguistics, 2003, pp. 188–191. https://doi.org/10.3115/1119176.1119206

7 Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn.; 1995; 20, pp. 273-297. [DOI: https://dx.doi.org/10.1007/BF00994018]

8 McNamee, P., and Mayfield, J., Entity extraction without language-specific resources, Proc. Conf. on Computational Natural Language Learning, 2002.

9 Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C., Neural architectures for named entity recognition, in Proc. Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA: Association for Computational Linguistics, 2016, pp. 260–270. https://doi.org/10.18653/v1/N16-1030

10 Batbaatar, E.; Ryu, K.H. Ontology-based healthcare named entity recognition from twitter messages using a recurrent neural network approach. Int. J. Environ. Res. Publ. Health; 2019; 16, 3628. [DOI: https://dx.doi.org/10.3390/ijerph16193628]

11 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., and Polosukhin, I., Attention is all you need, in Proc. 31st Int. Conf. on Neural Information Processing Systems (NIPS’17), Red Hook, NY: Curran Associates, 2017, pp. 6000–6010.

12 Lothritz, C., Allix, K., Veiber, L., Bissyandé, T.F., and Klein, J., Evaluating pretrained transformer-based models on the task of fine-grained named entity recognition, in Proc. 28th Int. Conf. on Computational Linguistics, Barcelona: International Committee on Computational Linguistics, 2020, pp. 3750–3760. https://doi.org/10.18653/v1/2020.coling-main.334

13 Tikhomirov, M.; Loukachevitch, N.; Sirotina, A.; Dobrov, B. Proc. Natural Language Processing and Information Systems. NLDB 2020; 2020; Cham, Springer: [DOI: https://dx.doi.org/10.1007/978-3-030-51310-8_2]

14 Malik, G., Cevik, M., Bera, S., Yildirim, S., Parikh, D., and Basar, A., Software requirement specific entity extraction using transformer models, Proc. Canadian Conf. on Artificial Intelligence, Toronto, 2022. https://doi.org/10.21428/594757db.9e433d7c

15 Ye, D., Xing, Z., Foo, C.Y., Ang, Z.Q., Li, J., and Kapre, N., Software-specific named entity recognition in software engineering social content, Proc. 23rd IEEE Int. Conf. on Software Analysis, Evolution, and Reengineering (SANER), Osaka, 2016, pp. 90–101. https://doi.org/10.1109/SANER.2016.10

16 Liu, Zh., Lin, W., Shi. Ya, and Zhao, J., A robustly optimized BERT pretraining approach with post-training, in Proc. 20th Chinese National Conf. on Computational Linguistics, Huhhot: Chinese Information Processing Society of China, 2021, pp. 1218–1227.

17 Feng, Zh., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., and Zhou, M.,, CodeBERT: A pre-trained model for programming and natural languages, in Proc. Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, 2020, pp. 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139

18 Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intellig.; 2020; 42, pp. 318-327. [DOI: https://dx.doi.org/10.1109/TPAMI.2018.2858826]

19 Anderson, P.E.; Reo, N.V.; DelRaso, N.J. Gaussian binning: a new kernel-based method for processing NMR spectroscopic data for metabolomics. Metabolomics; 2008; 4, pp. 261-272. [DOI: https://dx.doi.org/10.1007/s11306-008-0117-3]

20 Xu, C., Barth, S., and Solis, Z., Applying Ensembling Methods to BERT to Boost Model Performance, Stanford Univ., 01.11.2023.

21 UzZaman, N., Llorens, H., Derczynski, L., Allen, J., Verhagen, M., and Pustejovsky, J., SemEval-2013 task 1: TempEval-3: Evaluating time expressions, events, and temporal relations, in Proc. 2nd Joint Conf. on Lexical and Computational Semantics (*SEM), vol. 2: Proc. 7th Int. Workshop on Semantic Evaluation (SemEval 2013), Atlanta, GA: Association for Computational Linguistics, 2013, pp. 1–9.

Word count: 6912

Show less

Named Entity Recognition for Code Review Comments

Content area

Abstract

Full text