DeepOCL: A deep neural network for Object Constraint Language generation from unrestricted nature language

Abstract

Object Constraint Language (OCL) is one kind of lightweight formal specification, which is widely used for software verification and validation in NASA and Object Management Group projects. Although OCL provides a simple expressive syntax, it is hard for the developers to write correctly due to lacking knowledge of the mathematical foundations of the first‐order logic, which is approximately half accurate at the first stage of development. A deep neural network named DeepOCL is proposed, which takes the unrestricted natural language as inputs and automatically outputs the best‐scored OCL candidates without requiring a domain conceptual model that is compulsively required in existing rule‐based generation approaches. To demonstrate the validity of our proposed approach, ablation experiments were conducted on a new sentence‐aligned dataset named OCLPairs. The experiments show that the proposed DeepOCL can achieve state of the art for OCL statement generation, scored 74.30 on BLEU, and greatly outperformed experienced developers by 35.19%. The proposed approach is the first deep learning approach to generate the OCL expression from the natural language. It can be further developed as a CASE tool for the software industry.

Full text

Translate

Turn on search term navigation

INTRODUCTION

Formal specification is widely used and essential in software engineering, especially in safety-critical areas [1]. As a lightweight formal language, Object Constraint Language (OCL) helps users to achieve extensive and accurate specifications and dominates current applications [2]. OCL, a formal specification language based on the first-order logic, which describes the constraints of the Unified Modelling Language (UML), is semantically explicit. In practice, OCL is used to express the semantics of model invariants, queries and constraints for operations with applications in downstream tasks, such as requirements modelling for complex projects and automatic test case generation [3, 4].

Despite its significance, OCL has encountered dilemmas in its application. The main factor is that OCL is inherently difficult to use [5]. It is prone to imperceptible errors when writing constraints manually [6]. Due to the unfamiliarity with syntax, OCL is generally avoided in development [7]. Besides, the time spent on OCL development for model constraints accounts for considerable costs in complex projects [4]. Alternatively, Natural Language (NL) is often used as an informal constraint to complement the model semantics. Despite its ease of application, its semantics are too vague to be used as a formal constraint. Therefore, the research for generating OCL from the natural language automatically is valuable.

However, the research on this area faces the following challenges: (1) The ambiguity of natural language: natural language is complex in grammar and can be interpreted implicitly and flexibly. (2) The complexity of OCL syntax: OCL can express abundant semantics, and synonymy can be expressed in different textual forms. (3) The mapping between OCL syntax and Object-Oriented (OO) semantics is not apparent and easy to understand for the developers. As the current work concentrates on business in specific application scenarios, they can only perform transformations on the controlled natural language subset to OCL [6, 8].

In this paper, we proposed DeepOCL, a deep learning-based approach that achieves the sentence-level conversion of NL to OCL statement without domain models, which assists the developers in getting the required constraints with minor adjustments. DeepOCL consists of two components, the generator and the selector. Specifically, the generator takes the NL input and generates the OCL candidates, after which the selector chooses the most appropriate output. After fine-tuning the pre-trained model with the weighted loss function, the generator can be more sensitive to the tokens of the keywords. DeepOCL learns the assessment of the corresponding OCL through the selector, which enables DeepOCL to give more human-tailored snippets rather than only from the perspective of the statistical model. Compared to traditional methods, DeepOCL generates OCL at a large scale with unrestricted NL, which performs much better in inference and association. To address the lack of corpus, we manually collected OCLPairs, a sentence-level aligned dataset of OCL statements collected from the official OMG 1

documentation species and the case study from the RM2PT [9] projects.

In our experiments, we conduct the ablation study to demonstrate that each improvement to the model process is effective, including the selector and the weighted loss. We put the DeepOCL with the proposed OCLPairs dataset to the test. In the experiment, we apply three metrics to evaluate the result, including Bilingual Evaluation Understudy (BLEU) [10], Recall-Oriented Understudy for Gisting Evaluation (ROUGE-L) [11] and accuracy (Acc). The DeepOCL achieved 56.69 on BLEU, 34.49% on Acc and 59.84 on ROUGE-L on the test set. Qualitative examples are given to illustrate the result. Compared to experienced human developers, DeepOCL outperforms the subjects by 35.19% (BLEU), 99.94% (Acc) and 19.52% (ROUGE-L), respectively. Considering DeepOCL can implement inference instantly for an OCL statement while the same task often takes minutes for experienced developers, it suggests that DeepOCL significantly outperforms humans in efficiency.

In summary, we made the following contributions:

DeepOCL, a deep learning-based method for generating OCL statements from the unconstrained NL. To the best of our knowledge, this is the first work to use deep learning techniques.
OCLPairs, an NL-OCL sentence-level aligned dataset, accelerates further research in this area. To the best of our knowledge, it is the first corpus for this task.
We demonstrate that the proposed DeepOCL reaches the best performance on the proposed dataset. Moreover, ablation experiments demonstrate that each improvement to the model process is effective, which outperforms human developers.

The remainder of this paper is organised as follows. Section 2 presents background knowledge on the DeepOCL. Section 3 surveys the related work about OCL generation. Section 4 elaborates on the details of DeepOCL and Section 5 explains the research questions and the analysis of the experiment. Finally, Section 6 discusses threats to validity and Section 7 concludes the paper and our future directions.

BACKGROUND

In this section, we introduce some background knowledge of our work, including OCL and pre-trained language models.

Object Constraint Language

OCL 2

was developed by IBM in 1995 as a way to overcome the inability of UML to express detailed constraints in a system design. Recently, OCL also provides precise requirements constraints, system operations, and queries on meta-models in model-driven architectures. In the standard model transformation language, OCL is applied for transformation rules and completeness rules for expressing models.

OCL is a declarative language without side effects, significantly different from other programing languages. The contracts and expressions of OCL cannot change any element or state of the model. As a declarative language, contracts written in OCL do not specify a specific flow of operations but directly constrain the result. In addition, OCL is a strongly typed language: every operation and instance in the execution is checked for the correct type.

OCL provides a variety of expressions to satisfy different applications, which are expected to return Boolean values, including:

Invariant: Invariant is used to constrain instances to ensure that each instance in the model satisfies the constraint.
Query: Query checks if the current instance satisfies a constraint, which is constantly used to query the system data and returns the information to the user.
Operation contract: Operation contract constrains the legality of an operation, checking if the state of the system is satisfied before and after an operation on the model.

To better illustrate OCL as contracts and our work, we present a lightweight formal model as well as an operation contract in Figure 1, including a use case diagram, system sequence diagrams, contracts of system operations and a conceptual class diagram. The requirement model describes a trading system in a process at a single cash desk of a supermarket. The use case diagram shows the major roles and functions of the system, as Figure 1a. The system sequence diagram Figure 1b introduces the details of events when the cashier processes sales. The conceptual class diagram Figure 1d describes the classes and their relationship in the system. The given contract specifies the conditions that the state of the system is assumed to satisfy before the system operation, that is, pre-condition, and after the system operation, that is, post-condition. Figure 1c illustrates the OCL contract of operation enterItem, which consists of four parts: the signature, the definition section and the pre-condition section and post-condition section:

[IMAGE OMITTED. SEE PDF]

Signature

The signature specifies the system operation, and the name of the use case to which it belongs. Besides, it declares the input parameter and the return type.

Definition

The definition section defines the objects used in the following part. The given part defines the instance variable item of the Item class, all instances of Item kind, attribute Barcode equals barcode.

Pre-condition

The pre-condition specifies the properties of the system state that needs to be satisfied when the system operation is to be executed, including objects, attributes and links between objects. The example checks whether the currentSale and the item are valid and whether their attributes satisfy the conditions.

Post-condition

The post-condition defines the changes that the operation is to realise. As shown in Figure 1, it includes creating and deleting objects and links and modifying the attributes.

Pre-trained language model

Trained on a large corpus with various pre-trained tasks, the pre-trained language model performed well on multiple specific downstream tasks. Due to the lack of large labelled corpus, current pre-trained tasks are mostly unsupervised tasks, for example, LM (Language Modelling) in GPT [12], MLM (Masked Language Modelling) in BERT and RoBERTa, de-noising tasks in T5 [13] and BART [14]. To address the gap between the pre-training and the custom downstream tasks, we applied prompt in our work. Prompt is a new NLP paradigm, allowing the model to adapt to new scenarios with few labelled data after pre-training. Specifically, prompt offers the input templates with additional information to modify the downstream tasks closer to the pre-trained tasks, which promotes the model to perform better on specific tasks. In both T5 [13] and CodeT5 [15], different forms of the prompt are used for multiple downstream tasks to achieve better performance.

The pre-trained language model has achieved great success in natural language processing (NLP). Inspired by this, a considerable amount of pre-trained models were proposed and applied for Software Engineering tasks, for example, services classification [16, 17], code generation [18], code summarisation [19, 20], code completion [21] and clone detection [15], achieving significant progress. In this paper, we adopt CodeT5 [15] as the base model.

RELATED WORK

In this section, we introduce the related work of this paper, the OCL generation in different forms.

OCL generation

Generation from NL

Dedicating to generating useable constraints for specific application scenarios, current approaches are rule-based implementation translations. Bajwa et al. [22, 23] use semantic and syntactic analysis of the natural language to implement OCL constraint generation through Semantics of Business Vocabulary and Rules (SBVR) [24] as an intermediate morphology [6]. The SBVR captures the specifications in the natural language and represents them in the formal logic, which is later transformed into OCL constraints with the syntax model. Xu et al. [8] generate a tuple identifying (Object, Constraint, Reference) from NL through a natural language toolkit (NLTK) [25] to generate OCL constraints on orientation. The aforementioned works all have significant restrictions on the input language: Bajwa et al. identify verbs as class operations and SBVR rules can only accept NL input with limited semantics, which is more like a fill-in-the-blank translation. Similarly, only limited orientation words are supported by Xu et al. The OCLgen [3, 26] addresses the limited verbs better to some extent. OCLgen is designed as a support for UMTG [27], a tool for generating test cases to generate the required OCL constraints from the NL written in restricted use case modelling (RUCM). With semantic role labelling (SRL) and merging synonyms, fewer rules can support enormous verbs for the generation. However, as the OCL constraints are generated from the pre-defined syntax template, the output can only perform restricted semantics and forms. As designed for an exact application scenario, these efforts are proposed for limited OCL constraints: Bajwa et al. cannot handle query-based OCL, Xu et al. limit generating invariants and OCLgen aims at generating pre-conditions and post-conditions. Moreover, committed to generating directly useable contracts, the requirements model like Figure 1 are necessary for these approaches without which the generation cannot be accomplished. Such compulsory can sometimes be a hindrance to development.

Generation from other models

Another research trend is the generation of OCL statements through other models. Dang et al. [28] generate corresponding OCL business rules for snapshots of conceptual models through predefined OCL invariant patterns, providing a reference for designers; Shimba et al. [29] implement a bidirectional conversion between OCL and JML, thus enabling the development of OCL under the model-driven architecture, maintaining the semantics of OCL statements in multiple conversions; Tan et al. [30] identify dom elements in UML through predefined templates to generate potential OCL constraints on class diagrams, providing a simple reference. The above methods are experimental and are less commonly used in practical development. Currently, there are few techniques dedicated to constraint generation with deep learning. To our knowledge, only Kiziltan et al. [31] have attempted to use SVM-HMM [32], an implementation of the structural support vector machine (SVM), for the annotation of constraint-related content for its implementation of a grand unified NL-constraint generation.

Conclusion

The current works are all rule-based, translating the restricted natural language to the OCL with templates, which treat the domain model as a necessary input. Moreover, the capability of inference with implicit NL is lacking. Compared to the aforementioned works, our work (1) does not require a reference to the domain model, extracting information from the input text, which is less restrictive for the development process; (2) can generate required OCL statements without a template from unconstrained NL, which can be adopted for enormous scenarios without extra development; and (3) perform sufficient inference when handling NL with implicit semantics, which gives beneficial results even for vague requirements.

Generation for other programing languages

In addition to OCL, much effort has been put into generating other domain-specific languages, notably Structured Query Language (SQL). Facing the similar difficulties with OCL, writing the required SQL queries can be very challenging when faced with enormous, complex tables. In current work on the natural language to SQL generation, pre-training techniques are widely used and perform well on multi-table cross-domain data. Tao Yu et al. propose GRAPPA [33], a model based on a transformer encoder, which learns a representation from a joint representation of text and table pre-trains the model using a new text-pattern linking target on synthetic data. Peng Shi et al. propose GAP [34], a model based on encoder-decoder for transformer pre-training framework, which achieves better semantic parsing in the representation of the natural language and table structures by using generative models to pre-train data. As OCL and SQL face similar challenges in the generation, including complex nested structured statement generation, model encoding and element alignment, this work can provide a reference for OCL generation.

With the recent boom of deep learning in NLP and pre-trained models, generating code in general purpose language is becoming a hot topic [35]. Early work focussed on NL-code generation as a sequence-to-sequence machine translation. CodeBERT [18] employs the masked language modelling objective same as the BERT to conduct code-specific representation, which was applied to the natural language. Apart from the BERT-based approaches, the Pymt5 [36] and IntelliCode [37] employ GPT and encoder-decoder-based Transformer, respectively, to complete code generation tasks. Recently, more work devoted to applying code-specific knowledge to improve performance. GraphCodeBERT [38] uses the data flow from the programing function into CodeBERT. SPT-Code [39] uses the representation of the abstract syntax tree to enhance the model to perform better on the code-related downstream tasks with scarce corpus. The above pre-training methods enhance the efficiency of code generation from different perspectives.

DEEPOCL

In this section, we first introduce the overview of DeepOCL, including the pipeline and the architecture of the models. We then describe the input for each task, the weighted loss function and the OCLMetric. Finally, we illustrate the strategy applied for the generation and recommendation.

Overview

DeepOCL consists of two components: the generator and the selector. The lower part of Figure 2 shows how inference is performed. The generator converts NL descriptions into a list of corresponding OCL statements, from which the selector picks the most accurate statement as the target output. The selector can evaluate the generated OCL without referring to ground truth, providing the evaluation imitating developers.

[IMAGE OMITTED. SEE PDF]

The generator and the selector are fine-tuned from the original model with custom tasks. Specifically, the training details are illustrated in Figure 2. To address the dilemma of unsupervised scoring, we used a two-stage training strategy inspired by generative adversarial networks (GAN). Two tasks are trained separately in DeepOCL, including (1) the generative task and (2) the evaluation task. The original model is fine-tuned with the generative task I, supervised learning on the OCLPairs. The naive model comes from the original model after the first stage of the generative task, which can generate essentially accurate target statements, including the semantic structure and keywords. The naive model then generates the OCL synonymous substitutions for the labels in the training set. After that, OCLMetric rates the score for each generated sample (NL, OCL_{substitutions}), obtaining the label for the evaluation task. The selector is fine-tuned from the naive model with the evaluation task, while the generator is obtained by fine-tuning with the generative task II with a weighted loss function, which enables the generator better generate the keywords in the OCL statements.

DeepOCL takes the CodeT5 [15] as the original model, a pre-trained Transformer proposed for the code-related downstream tasks. Architecturally, it is a T5-based Transformer. For the most part, the T5 [13] is identical to the traditional Transformers, which use an attention-based encoder-decoder structure. What differs from the Transformer is its word embedding phase. Compared with absolute positional encoding [40] generally used in typical Transformers, T5 uses the relative positional embedding (RPE), focussing on only two elements when implementing the attention mechanism at the first layer of the encoder and the decoder.

As far as the parameter setting of the model is concerned, the generator and the selector share the same setting as the original model: (1) the number of Transformer blocks in the encoder and decoder L = 12, (2) the size of the model d_model = 768, (3) the dimension of feed-forward d_ff = 3072, and (4) the number of self-attention heads h = 12. The total number of parameters is 212M.

Model input

DeepOCL tokenises the input text with the RoBERTa tokeniser. It uses a Byte-Pair-Encoding (BPE) name representation, which performs well with the Out-Of-Vocabulary (OOV) problem. It successfully addresses the encoding of the text in our dataset.

DeepOCL fine-tuned the two components with the generative task and evaluation task specifically. The details of their inputs are introduced below:

Generative task

The generative task uses samples from OCLPairs for supervised training. By taking the input NL for the output OCL, the aim is to make the generator learn the relationship between NL and the desired OCL statements. DeepOCL first pre-processes the input text, including removing the comments, line breaks and additional spaces. This step aims to enable the generator to learn the desired features better than the noise.

After the pre-processing of the text, DeepOCL constructs the prompt for the task, allowing the model to perform better when generating. Specifically, for the NL input, we concatenate it with the prefix and a delimiter token [SEP] and represent the whole input sequence in the format as X = (P₁, …, P_x, N₁, …, N_y, [SEP]), where x and y denote the number of prefix tokens and NL word tokens. The prefix here is the prompt for the generative task.

Evaluation task

With the evaluation task, DeepOCL learns the relationship between the input (NL, OCL) and the corresponding score, which enable the selector to pick the result more closely to the expectation without a reference. To address the dilemma of the insufficient corpus, samples for the task are synonyms of OCLPairs, generated by the naive model. Specifically, the naive model performs OCL generation on the training set in OCLPairs to obtain multiple synonymous sentences {S₁₁, …, S_1n, …S_mn}, where m denotes the number of the input NL sentences and n denotes the number of the generated synonymous statements for each NL. The generated data are then evaluated by OCLMetric; refer to the original OCL to obtain the score for each generated candidate. The generated dataset comprises each sample's (NL, OCL, Score). After the generation, the de-duplication process is conducted, after which the sentences with too high scores are filtered out to prevent over-concentration of samples.

For the generated corpus, the prompt for the evaluation task is constructed. We concatenate the sample with the evaluation prefix and [SEP], as the format X = (P₁, …, P_x, N₁, …, N_y, [SEP], O₁, …, O_z, [SEP]), where x, y and z denote the number of prompt word tokens, NL word tokens and OCL code tokens respectively.

Weighted loss

Sharing the same architecture with T5 [13], the generator and the selector treat different kinds of input and output as simple text strings. To train the model better generate the target tokens, t_n, n ∈ [1, …, N], where t_n denotes the target token, N denotes the length of the sequence, DeepOCL estimates the following conditional probability distribution P, where the c denotes the input context: 1 $P\left({t}_{0},\text{\ldots },{t}_{N}\vert \boldsymbol{c}\right)=\prod\limits _{i=1}^{N}P\left({t}_{i}\vert {t}_{0},\text{\ldots },{t}_{i-1},\boldsymbol{c}\right)$

The generator and the selector try to predict the next token based on the given context and optimise the model with the loss function. For the evaluation task and generative task I, we use the standard cross entropy as the loss function: 2 $Loss=-\sum\limits _{i=1}^{N}\log \,\mathbb{P}\left({t}_{i}\vert {t}_{1},\text{\ldots },{t}_{i-1},\boldsymbol{c}\right)$

Moreover, for generative task II, we have adapted the loss function to be more OCL-compatible, inspired by Cong et al. [41]. In OCL statements, the keywords (denotes as Keys), including standard library functions, reserved words and internal attributes, are more informative; therefore, if the generated result differs from the ground truth in Keys, it often implies a significant deviation in the generation. Accordingly, a greater penalty should be obtained for incorrect predictions of Keys.

We propose a sequence of weights Weight = {1, …, σ, …, 1} with the length of the weight sequence consistent with the size of the dictionary, where tokens belonging to Keys corresponding to the index will have a higher weight. The content of Keys is illustrated in Table 1. Differently, all Keys share the same weight σ rather than differ from type in the OCLMetric. 3 $\left\{\begin{array}{@{}c@{}}\hfill {w}_{i}=\sigma ,\hspace*{.5em}\hspace*{.5em}{t}_{i}\in Keys\hfill \\ \hfill {w}_{i}=1,\hspace*{.5em}\hspace*{.5em}{t}_{i}\notin Keys\hfill \end{array}\right.$ where w_i denotes the weight for the tokens of the corresponding types. The improved loss function is given in equation as follows: 4 $Los{s}_{\mathit{Weighted}}=-\sum\limits _{i=1}^{N}{w}_{i}\log \,\mathbb{P}\left({t}_{i}\vert {t}_{1},\text{\ldots },{t}_{i-1},\boldsymbol{c}\right)$

TABLE 1 The content of Keys and weight.

	Content	Weight
Reserved words	And, attr, body, context, def, if, else, endif, endpackage, implies, in, inv, let, not, oper, or,package, post, pre, post, then, xor, derive	3
Built-in properties	oclIsTypeOf, oclIsKindOf, allInstances, oclInState, oclIsInState, isOperationCall, oclIsUndefined, hasReturned, result, isSignalSent, oclIsNew, oclAsType	1
Standard library
Integer and real	Div, mod, abs, max, min, floor, round, toString
String	Concat, <>, substring, toUpper, toLower, toInteger, toReal, size
Collection	Includes, excludes, count, xcludesAll, isEmpty, notEmpty, sum, exists, forAll, isUnique, sortedBy, iterate includesAll	2
Set	Union, intersection, including, excluding, symmetricDifference, select, reject,collect, flatten, asSequence, asBag
Bag	asSet
Sequence	Append, prepend, insertAt, one, subSequence, first, at, indexOf, any, collectNested, reverse

OCLMetric

To better evaluate the generated OCL statements for the evaluation task, we proposed OCLMetric, a combination of different matches (Figure 3).

[IMAGE OMITTED. SEE PDF]

Overview

OCLMetric accesses the quality of the generated OCL statement based on comparing the reference and the hypothesis. One OCL statement's evaluation is a composite of three assessment areas. The improved BLEU calculates the general similarity of the text, while the Keys Match evaluates the matching ability of the keywords. The AST Match measures the syntax information. OCLMetric score is a weighted combination of the previous three matches: 5 $OCLMetric=\mathrm{exp}\left(\frac{\alpha \ast {\sum }_{i=1}^{3}{w}_{i}{v}_{i}\ast \mathrm{ln}\left({S}_{K}\right)+\beta \ast \mathrm{ln}\left({S}_{B}\right)+\gamma \ast \mathrm{ln}\left({S}_{A}\right)}{\alpha \ast {\sum }_{i=1}^{3}{w}_{i}{v}_{i}+\beta +\gamma }\right)$ where S_K, S_B and S_A denote the score of the three perspectives. The w_i and v_i denote the weight and the validity of the corresponding Keys. OCLMetric is actually a weighted geometric mean of multiple scores, evaluating the match of an OCL statement in terms of text and structure.

The improved BLEU

The naive BLEU evaluates the similarity between the hypothesis and the reference ones from the matched n-grams. However, the main factor that leads to the failure in evaluating shorter codes is the disability for dividing longer grams from short snippets, resulting in an incorrect score.

To remedy this, we have adopted BLEU to assess the shorter snippets properly. The improved BLEU is computed as follows: 6 ${S}_{B}=BP\ast S$ where BP denotes the brevity penalty, S denotes the matching score, which is computed as follows: 7 $S=\mathrm{exp}\left(\frac{1}{K}\sum\limits _{n=1}^{K}\mathrm{ln}\,{p}_{n}\right)$ where the p_n denotes the individual matching score for n-gram. The K is the maximum length for the grams, which generally takes 4. The improved BLEU no longer uses a fixed length of grams to calculate scores but instead determines the longest gram based on the length of the snippet. Specifically, we chose K = 4 as the longest gram it evaluated, which varies from K = 2 to K = 4 depending on the situation. 8 ${p}_{n}=\frac{\sum\limits _{c\in candidates}\sum\limits _{n-gra{m}_{\in c}}Coun{t}_{\mathit{clip}}(n-gram)}{\sum\limits _{{c}_{\in candidates}^{\prime }}\sum\limits _{n-gra{m}_{\in {c}^{\prime }}^{\prime }}Count\left(n-gra{m}^{\prime }\right)}$ where n − gram denotes the n-gram string in the text and Countclip(n − gram) is the maximum number of n-grams co-occurring in the hypothesis and the corresponding reference. Naturally, we also calculate the brevity penalty according to the original BLEU: 9 $BP=\left\{\begin{array}{@{}ccc@{}}\hfill \hfill & \hfill 1\hfill & \hfill if\ {l}_{c}\ge {l}_{s}\hfill \\ \hfill \hfill & \hfill {e}^{1-\tfrac{{l}_{s}}{{l}_{c}}}\hfill & \hfill if\ {l}_{c}< {l}_{s}\hfill \end{array}\right.$ where l_c denotes the length of the candidate code, whereas l_s denotes the length of the reference code.

For the adaptation, 1-g measures the adequacy of the generation and 2-g measures the generation's fluency, which is sufficient for shorter code segments for an adequate metric. The improved BLEU considers the matching of shorter grams as the only scoring factor when the hypothesis is short, matching effectively for more extended codes.

The Keys Match

Compared to NL, OCL consists of predefined literals in code that are more informative, indicating the underlying semantics and syntax. Therefore, we introduce the Keys Match, which measures the similarity of codes by calculating the matching of feature keywords within the code.

In OCL, Keys consist of three categories, including reserved words (i.e., implies), internal attributes (e.g., oclIsTypeOf), and standard library functions (e.g., notEmpty). The Keys Match computes the score S_K by calculating the ratio of matched Keys, detailed by the following equation: 10 ${S}_{K}=100\ast \frac{{\sum }_{i=1}^{3}{w}_{i}{k}_{i}}{{\sum }_{i=1}^{3}{w}_{i}{v}_{i}}$ 11 ${k}_{i}=\frac{\min \left(coun{t}_{C}(i),coun{t}_{R}(i)\right)}{coun{t}_{R}(i)}$ where the k_i denotes ratio of the corresponding Keys occurring in the hypothesis to the reference. Same as the recall rate, it does not include excessive Keys of the generated code. The S_K denotes the score for the Keys Match, a weighted combination of the matching scores for Keys in different categories. The w_i denotes the weight for the Keys of the corresponding category, while the v_i means the validity of the corresponding Keys, a binary number determined by whether it occurs in the reference.

The significance of Keys varies from the category. For example, literals that indicate logical relations, control flow and statement types are reserved words with little possibility of expressing the same or even similar semantics through alternatives, whereas Keys that express class properties, inheritance relations, etc. are built-in properties and have greater substitutability; the standard library functions are intermediate between the two and have potential for synonymous substitution. Based on such observations, we weigh the different kinds of Keys according to the types, contents and weights shown in Table 1.

AST match

In addition to matching textually, we also measure both the structure and the syntax of the code in AST Match. In contrast to natural languages, OCL has restricted syntax, which enables it to be converted into an equivalent abstract syntax tree (AST) by the compiler. We parse OCL statements to obtain the required AST. In the AST, each root node represents a non-terminal symbol, that is, a syntactic representation in the code, and the leaf nodes are their corresponding literals.

To evaluate the similarity of AST, we perform a tree structure evaluation, N-depth match, inspired by the n-grams match in a text metric. In the n-depth match for AST, we match subtrees at 1-depth and 2-depth, respectively. For each root node, we treat it as a sub-tree of height 1, matching its predefined type. For each 2-depth subtree, we compare (1) the type of each node and (2) the number of sons of the root node. The syntactic similarity is matched by 1-depth, and the structural similarity is matched by 2-depth as follows: 12 ${m}_{i}=\frac{Coun{t}_{\mathit{matched}}\left({T}_{\mathit{Ref}},{T}_{\mathit{Can}}\right)}{Coun{t}_{\mathit{all}}\left({T}_{\mathit{Ref}}\right)}$ 13 ${S}_{A}={\left({m}_{1}{m}_{2}\right)}^{\delta }$ where T_Ref and T_Can denote for the AST of the reference snippet and candidate snippet, respectively. The m_i denotes the ratio of matched sub-trees for i-depth trees. In this paper, the δ takes $\frac{1}{2}$ , making the S_A the geometric mean of m₁ and m₂.

OCL generation and recommendation

To make the DeepOCL more efficient for the programmer, the output OCL statements are ranked according to the correctness and matching after generation. In this phase, the DeepOCL generates several OCL statements for the input with the generator, which are re-ranked and selected by the selector.

Top-P sampling

The generator uses the top-p sampling [42] to generate the most relevant sequences for the predictions. In particular, the most miniature set of words will be chosen, whose cumulative probability exceeds the probability p, and the probability is then redistributed to them while the others are zeroing out. We tried several values for p (0.9, 0.95) and picked 0.9, since it performed best in our experiment. Another decoding strategy, beam search, was applied in the generator. With much more time consumed, it showed little efficiency compared to the sampling. The details are performed in Section 5.

Re-ranking with selector

During the inference phase, the selector evaluates the generated candidate OCL statements. DeepOCL selects the most suitable candidate as the final output based on the scores. Unlike the decoding stage, where the output candidates are scored from a statistical perspective, the selector scores the samples by imitating OCLMetric, intending to rate the output that is more closely with the developer's expectations in terms of syntactic semantics. The selector provides the best OCL statement for the user.

EVALUATION

In this section, we first illustrate the research questions and the motivations. Then, we introduce our dataset and the experiment settings. Finally, we answer the research questions with quantitative and qualitative analyses and summarise the experiments with discussion.

Research questions

To evaluate the proposed DeepOCL, we formulate the following research questions.

RQ1: What are the contributions of DeepOCL's components?

To better generate the required OCL statements, we applied two different techniques to our DeepOCL, that is, (1) weighted loss function for the generator to improve the model more sensitive to the valuable information in the OCL; and (2) the selector to allow the DeepOCL produce more human-friendly results. In this RQ, we set out the ablation study to evaluate the contribution of each component of DeepOCL.

RQ2: What is the impact of different decoding settings?

As a hyper-parameter, the decoding settings directly determine the token selection at inference, resulting in a significant impact on the performance of the model. Hence, we set out this RQ to prove the validity of the chosen decoding settings mentioned in Section 4.

RQ3: How effective is DeepOCL compared to human developers?

DeepOCL is proposed to address the dilemmas of OCL so as to support development better. As human developers accomplish the actual development work, we set this RQ to investigate the effectiveness of DeepOCL compared to human developers.

Dataset

There is no existing publicly available dataset of NL-OCL statements, obscuring the model's training. For this reason, we proposed OCLPairs, a sentence-level aligned corpus of NL-OCL, for training and evaluating DeepOCL. The following sections introduce the source of data and the process strategy.

Source and features

OCLPairs currently has a total of 1308 samples, collected from the following three sources:

OMG documents

The Object Management Group (OMG) is an organisation that established standards for modelling. We collected 861 NL-OCL sample pairs from 15 different documents in this paper. The corpus from the source has the following features: (1) Highly specialised words: Words in the software engineering area appear with a high frequency, for example, abstract, multiplicity. (2) Strong contextual dependency: Apart from the corresponding NL statements, there are contextual statements, system descriptions and class diagrams related to the OCL statements, which are invisible in the corpus. (3) Loud noise in NL: Descriptions in NL are not simple to understand, and a few samples have biases, omissions and errors.

Education websites and thesis

243 pairs were collected from the XMLdation 3

and the thesis [7]. As educational examples, featuring in (1) Colloquial expressions: The vocabulary and expressions are more colloquial and comprehensible. (2) Simple grammatical structure: Fewer long sentences, nested sentences and sentences with complex logical relationships appear in this section.

Case study in RM2PT

RM2PT [43] is a case tool for automatic prototype generation from a requirement model by formal contracts. With OCL statements input, the RM2PT can automatically generate the corresponding comments. 455 samples were collected from the four example cases 4

currently available. This part of the sample is characterised by the following points: (1) Regulated, noise-free: Generated by the tool, each pair is correct and strictly aligned. (2) Coverage at different levels: This part contains both short statements with simple expressions and longer complex sentences with more constraints, covering a wide range at the syntax and semantic levels.

Processing strategy

The data from OMG documents and websites and thesis were collected manually, since there is no restricted interface for them. When collected as much as possible from the raw data, we filtered out the noise and inaccurate samples. The data from RM2PT was collected automatically with the script. After the initial selection of the collection, the OCLPairs consist of 1308 samples.

When further refining the data, we attach the following strategy: (1) De-duplicate: For simple duplicate data items, we deleted the others so that only one item remained; for highly similar entries, we deleted most of them. (2) Merge: For samples with the same semantic meaning but different expressions, we merged the features, for example, replace the ‘<’, ‘<<’, and ‘≪’ with ‘«’. (3) Modify: For ambiguous or incorrect items, revise the recoverable ones and delete the rest. The same strategy is applied for data items with excessive context dependencies.

Experimental setup

We use the pre-trained model, codet5-base 5

, proposed on the Huggingface library 6

, as the original model in DeepOCL. We set the peak learning rate to 5e–5 with a linear decay of weight 1e–2. We employ the batch size of 24 and a maximum sequence length of 1024. Our training and experiments are conducted on a server with an NVIDIA A40 GPU, an AMD CPU EPYC 7543 with 32 core processors, and 80G RAM. Our datasets are publicly available at Github 7

Unless specifically stated, DeepOCL in the experiments used the following training strategy: We first train the original model on the generative task for 50 epochs, and the generator was obtained from the naive model trained by the second generative task with the weighted loss for another 100 epochs. We generate 3 synonymous OCL statements for each input NL as the corpus for the evaluation task. To maintain the balance of the data, the samples that received scores above 92 were randomly sampled. The selector is trained on the generated corpus for 100 epochs. Because of the arbitrary nature of the results, we evaluated the performance five times and took the best. To evaluate DeepOCL, we had 2 developers in our study. The human developers were all graduate students with over 2 years experience in constraints development with OCL.

Evaluation metrics

We use three evaluation metrics that are common for the automatic evaluation of code generation, including Bilingual Evaluation Understudy (BLEU) [10], Recall-Oriented Understudy for Gisting Evaluation (ROUGE-L) [11] and accuracy (Acc).

BLEU measures the accuracy rate of the reference and the generated sentence n-grams. By combining the matching and length penalties for different n-grams, the BLEU score can reflect the accuracy and fluency of the text translation. The detail of BLEU is computed as Equation (5). In this paper, we measure the matching of 1 to 4-g to calculate the final score as follows: 14 $BLEU=BP\ast \mathrm{exp}\left(\frac{1}{K}\sum\limits _{n=1}^{K}\mathrm{ln}\,{p}_{n}\right)$ where BP denotes the brevity penalty, p_n denotes the individual matching score for n-gram and K is the maximum length for the grams.

ROUGE reflects the recall rate of the same n-grams between the sentence. As an implementation of ROUGE-N, ROUGE-L evaluates the sequence of the matching words with the longest common sub-sequence algorithm. It considers word-level accuracy and structure similarity as it is less restrictive on sentence lexical continuity. The ROUGE-L is computed as follows: 15 $ROUGE\mbox{-}L=\frac{\left(1+{\beta }^{2}\right)\ast {R}_{\mathit{lcs}}\ast {P}_{\mathit{lcs}}}{{R}_{\mathit{lcs}}+{\beta }^{2}\ast {P}_{\mathit{lcs}}}$ where ROUGE − L is the evaluated score, R_lcs and P_lcs denote recall and accuracy of the longest common sub-sequence between the reference and the generated sentence, respectively.

Acc measures the proportion of samples that exactly matches the character level. In this paper, two sentences are matched only if they are entirely the same, ignoring the spaces and the line breaks. The Acc is computed as follows: 16 $Acc=\frac{Coun{t}_{\mathit{matched}}\left({S}_{\mathit{Ref}},{S}_{\mathit{Can}}\right)}{Coun{t}_{\mathit{all}}\left({S}_{\mathit{Ref}}\right)}$ where Acc is the ratio of matched sentences to all generated sentences.

Experimental results

RQ1: What are the contributions of DeepOCL components?

To evaluate the contribution of each component of DeepOCL, we set out the ablation experiment for the weighted loss function (WL) and the selector (S). We analyse the effect of ablation on the fluctuation of scores to demonstrate the effectiveness of each component. In the experiment, we randomly evaluated 50 samples from the test set of OCLPairs. We kept the settings unchanged except for the components that needed to be analysed in each experiment. In the experiments for the weighted loss function, we explored the result of using an excess weight in addition to the typical ablation to demonstrate impact of it. In the experiments, the weight of DeepOCL was taken as σ = 3. As a comparison, the excess weight was taken to be σ = 7.

Result

Table 2 shows the results for our DeepOCL with different ablation. Both components contribute positively to DeepOCL. For the metric of token level BLEU and ROUGE-L, the weighted loss dominates the contribution, achieving an improvement by 2.3% on BLEU and 3.8% on ROUGE-L, respectively, with a minor contribution of 2.1% for Acc compared to the selector. However, it is clear that when σ is taken too high, the performance of DeepOCL decreases in contrast. The selector mainly contributes to the Acc on DeepOCL, achieving an improvement of 7.4%.

TABLE 2 Ablation study on DeepOCL with/without weighted loss and selector.

Approach	BLEU (%)	Acc (%)	ROUGE-L (%)
CodeT5 (baseline)	54.77	31.78	56.72
DeepOCL w/o WL	55.37	33.78	57.63
DeepOCL w/excess WL	55.12	33.41	57.14
DeepOCL w/o S	56.13	32.12	58.96
DeepOCL	56.69	34.49	59.84

We provide an example of qualitative analysis in Figure 4. It is clear that DeepOCL generates the closest OCL statement to the reference, while lacking S or WL hurts the quality of the generation. The critical information in the natural language is that the context of the statement should be the AssertedEvidence, so the self should imply the AssertedEvidence. Besides, the source implies the source instance, so there should be the forall, the operation for the collections. The DeepOCL w/o the selector fails to catch these points. The DeepOCL w/o the Weighted Loss mistakenly generates oclIsKindof instead of oclIsTypeof, which incorrectly expanded constraints (super-type is also acceptable). In the comparison, the DeepOCL with the excess Weighted Loss can correctly generate the oclIsTypeOf, but cannot correctly generate the entire statement.

[IMAGE OMITTED. SEE PDF]

RQ2: What is the impact of different decoding settings?

To demonstrate the effect of different decoding settings, we conduct experiments on sampling decoding (Top P [42] and Top K [44]) and beam-search decoding. We analyse the effect of decoding settings on the fluctuation of scores so as to demonstrate its effectiveness. In the experiment, the model generates 5 statements under different settings, from which the selector chooses the best candidate as the final result. We randomly evaluated 50 samples from the test set of OCLPairs. We kept the settings unchanged except for the decoding settings in each experiment.

Result

Table 3 shows the results of our DeepOCL with 3 decoding strategies. For the beam search decoding, the results are based on three beam widths, B = {5, 7, 10}, where B is the number of hypotheses when searching for the candidate sequences. For the sampling decoding, the results are based on two thresholds, P = {0.9, 0.95} and K = {20, 50}, where the words whose cumulative probability exceeds the probability P, or the K most likely following words are filtered, among which the probability mass is redistributed.

TABLE 3 Experiments on decoding settings.

Strategy	Parameter	BLEU (%)	Acc (%)	ROUGE-L (%)
Beam search	Beam = 5	49.87	24.61	55.20
Beam = 7	50.12	23.08	55.30
Beam = 10	49.30	23.85	54.73
Top-P sampling [42]	p = 0.95	55.47	34.38	58.88
p = 0.9	56.69	34.49	59.84
Top-K sampling [44]	K = 20	54.70	32.31	58.49
K = 50	54.12	32.88	58.29

Compared to the beam search decoding, our DeepOCL shows a significant advantage under each metric, when sampling decoding is applied. For the beam search decoding, the model's performance does not increase as the beam increases, despite its higher demands on computing power. Even more, the performance decrease in aspects as the beam is set for more, which should give a better result. Table 3 shows that the top-P sampling is better than top-K sampling on average. DeepCCL achieves the best performance under the top p sampling when P is set to p = 0.9, which can be attributed to the fact that arbitrariness can enhance the model to generate diverse outputs for the selector to pick up.

RQ3: How effective is DeepOCL compared to human developers?

To compare the efficiency of DeepOCL with that of human developers, we conduct experiments to translate the informal requirements into corresponding OCL statements by DeepOCL and the human developers, respectively. We compare and analyse the outcome from different approaches to demonstrate the effectiveness. In the experiment, we employed two experienced developers as the subjects who were required to write 12 OCL statements from requirements in the natural language. The results are evaluated with the three mentioned metrics. We use the best performance settings for DeepOCL and average the scores from two subjects as the final result.

Result

Table 4 shows the results of DeepOCL and human subjects. DeepOCL achieves considerable improvements on each metric compared to human developers with 35.19% on BLEU and 19.52% on ROUGE-L, respectively. DeepOCL performs most advantage on the Acc, with 99.94% better than humans, which proves that DeepOCL can better generate precisely the same output as the reference.

TABLE 4 Experiments on DeepOCL and human.

Approach	ROUGE-L (%)	Acc (%)	BLEU (%)
Human	61.13	16.67	54.96
DeepOCL	73.06	33.33	74.30
Improvements	19.52%	99.94%	35.19%

We further analysed the samples to clarify the advantages of DeepOCL. In contrast to humans, DeepOCL seldom makes low-level mistakes, for example, a subject incorrectly wrote item.oclIsUndefined = false for Item doesn't exist, which is commonly found for human developers and hard to avoid. Besides, DeepOCL performs better on inference. For some samples, the correct OCL statement contains information that is not available in natural language requirements, for example, attributes or relationships, which is difficult for human developers to infer and thus obtain the correct statement. After training on a specific corpus, DeepOCL can implement inference on inputs within a specific domain to reveal the implicit semantics leading to correct results.

Discussion and limitation

One of the main advantages of DeepOCL is that it can select more accurate statements without a reference from a human perception, which allows it to give more accurate predictions and inspirations. The ablation study also proved it. Nevertheless, as a simple textual improvement, the weighted loss offers more limited performance improvements to DeepOCL, while the weights need to be chosen appropriately.

Compared to existing contracts written entirely based on informal requirements manually, DeepOCL can give developers good aid without a domain model, gaining significant improvements in accuracy and efficiency. Considering that DeepOCL can handle NL and OCL in unrestricted formats, it is suggested that DeepOCL provides a better response to the existing challenges of OCL applications.

However, while the selector contributes to DeepOCL dominantly by selecting more accurate statements from the human perception without a reference, it could rate the statements that are entirely correct with incredibly low scores on occasion, which is hard to explain or prevent. Although it enhances the performance of DeepOCL on average, it could lead to worse results for random natural language requirement inputs. Besides, in comparison with existing approaches, DeepOCL requires the developer to align the OCL statement with the element in the domain model, which could be problematic in some scenarios.

THREATS TO VALIDITY

The hyperparameters of the model and training process have a strong impact on the performance of the deep learning-based approach. However, we did not adjust the hyperparameters of DeepOCL experimentally. Therefore, better results may be possible with other settings. Moreover, the random samples taken for the experiment and the instability of the generated results may also have shown some bias, which caused slight fluctuations in the result. There were also limited σ-values for validation when tested in the ablation experiments, which need further investigation. In our experiments for RQ3, we evaluated the results through three metrics. The evaluation in terms of literal quantities is only partially reflective of the accuracy and correctness of the results. Besides, human subjects may also introduce the subjective error. The dataset OCLPairs also poses a threat to our evaluation. Though initially filtered and processed, data from the Internet may still contain erroneous constraints.

CONCLUSION AND FUTURE WORK

In this paper, we present DeepOCL, a deep learning-based approach that enables NL-OCL generation. Compared to traditional approaches, DeepOCL does not require a domain model and can select the most relevant OCL statements from the generated results without ground truth. Trained on the comprehensive corpus, OCLPairs, DeepOCL can convert unconstrained requirements in the natural language to the desired OCL statements, offering vital assistance for developers. Experiments demonstrate that DeepOCL can effectively generate the required OCL statements, which can be adopted for further addressing the challenges of OCL writing.

We believe that DeepOCL can contribute to the automation of informal requirements to formal contracts by improving efficiency and accuracy in the development of the formal specification. In future work, we plan to continue to improve DeepOCL to reduce its reliance on data, cope with the lack of a corpus and improve its performance.

ACKNOWLEDGEMENTS

The work was supported by the National Key Research and Development Program of China (No. 2021YFB2501301).

CONFLICT OF INTEREST STATEMENT

The authors declare no conflicts of interest.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are openly available in RM2PT/DeepOCL at .

References

Bois, P.D., Dubois, E., Zeippen, J.‐M.: On the use of a formal requirements engineering language: the generalized railroad crossing problem. Requir. Eng. 2(4), 171–183 (1997). [DOI: https://dx.doi.org/10.1007/bf02745370]

Holtzman, A., et al.: The curious case of neural text degeneration. In: International Conference on Learning Representations (2019)

Yang, Y., et al.: RM2PT: a tool for automated prototype generation from requirements model. In: 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE‐Companion), pp. 59–62. IEEE (2019)

Fan, A., Lewis, M., Dauphin, Y.: Hierarchical neural story generation. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 889–898 (2018). Long Papers

Word count: 7857

Show less

© 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

DeepOCL: A deep neural network for Object Constraint Language generation from unrestricted nature language

Content area

Abstract

Full text