NATURAL LANGUAGE PROCESSING FOR DETECTING FORWARD

Full text

Turn on search term navigation

Headnote

Abstract-Meyer's seven sins have been recognized as types of mistakes that a requirements specialist are often fallen to when specifying requirements. Such mistakes play a significant role in plunging a project into failure. Many researchers were focusing in ambiguity and contradiction type of mistakes. Other types of mistakes have been given less attentions. Those mistakes often happened in reality and may equally costly as the first two mistakes. This paper introduces an approach to detect forward reference. It traverses through a requirements document, extracts, and processes each statement. During the statement extraction, any terms that may reside in the statement is also extracted. Based on certain rules which utilize POS patterns, the statement is classified as a term definition or not. For each term definition, a term is added to a list of defined terms. At the same time, every time a new term is found in a statement, it is check against the list of defined terms. If it is not found, then the requirements statement is classified as statement with forward reference. The experimentation on 30 requirements documents from various domains of software project shows that the approach has considerably almost perfect agreement with domain expert in detecting forward reference, given 0.83 kappa index value.

Keywords-Forward Reference; Natural Language Processing; Term

Abstrak-Meyer's seven sins dikenal sebagai jenis kesalahan yang sering dilakukan sistem analis ketika menspesifikasi kebutuhan. Kesalahan-kesalahan tersebut berperan besar sebagai penyebab gagalnya sebuah proyek. Banyak peneliti memfokuskan dir i pada kesalahan berjenis kerancuan dan kontradiksi. Jenis kesalahan yang lain kurang mendapat perhatian. Padahal jenis kesalahan tersebut juga pada kenyataannya sama dampak finansialnya disbanding dua jenis pertama. Artikel ini menjelaskan sebuah pendekatan untuk mendeteksi forward reference. Pendekatan ini akan mengekstrak dan memproses setiap pernyataan dalam dokumen kebutuhan Selama proses ekstraksi tersebut, setiap istilah yang ditemukan juga diekstraksi. Berdasarkan aturan tertentu yang memanfaatkan pola POS, pernyataan diklasifikasikan sebagai sebuah definisi istilah atau bukan. Untuk setiap definisi tersebut, sebuah istilah akan ditambahkan ke daftar istilah terdefinisL Pada saat yang sama, untuk setiap kali sebuah istilah baru ditemukan dalam sebuah pernyataan, pendekatan ini akan mengecek eksistensi definisinya. Jika tidak ditemukan, maka pernyataan tersebut diklasifikasikan sebaga pernyataan yang mengandung forward reference. Hasil pengujian atas 30 dokumen kebutuhan dari berbagai ranah proyek perangkat lunak menunjukkan bahwa pendekatan ini hampir dapat diandalkan sebagaimana seorang ahli dalam mendeteksi forward reference, dengan nilai kappa 0.83.

Kata Kunci-Forward Reference, Istilah, Pemrosesan Bahasa Alamiah

I. Introduction

Requirements specification as part of requirements engineering is mainly dealing with how to express requirements in a specific, measurable, realizable, attainable, and time-bound manner. Requirements specification should be agreed by all stakeholders. It concerned with the process to elicit, analyze, and validate/verify requirements. These processes are documented for the most part in natural language. Software Requirements Specification (SRS) is one of deliverables produced iteratively throughout software development lifecycle. It is one of the most important artefacts produced during this phase of software development. The quality of SRS document determines whether a software project may end up as a success story or just another project failure. It stands as the first entrance before and provides input for design, coding and testing phases. The report in 2009 on software project chaos from Standish Group indicates that 31.1% software projects failure rooted from requirements specification. Therefore, considerable resources, in term of man hour, are spend in order to ensure the SRS document quality This is due to the fact that the real-life SRS documents may take up to considerable amount of pages, sentences, figures, and tables.

During requirements specification, engineers focus on specifying requirements, which on most cases is written in natural language. Therefore, requirements specification inherits subjectivity of natural language. This often leads to common mistakes made by engineer when specifying requirements. These mistakes are known as Meyer's seven sins [1]. Meyer's seven sins indicate that there are seven common mistakes that are often found in requirements document, i.e. noise, silence, over-specification, contradiction, ambiguity, forward reference, and wishful thinking.

Researchers have been working on identifying and dealing, with such mistakes for the last two decades. Ambiguity has been receiving the most attention from researchers [2]-[6]. There are researchers from Stanford [7] who have been working on detecting contradiction between text. Nevertheless, so far less attention has been given to other type of mistakes, aside from the fact that they all are equally important.

This work focuses on creating an approach to detect forward reference in requirements specification document. Forward reference refers to a first appearance of a term in passage which precedes its definition. To our knowledge, there has been no previous work that focusing on forward reference detection. Our approach uses natural language processing (NLP) library for capturing terms within a document and determining whether a statement contains a definition of a term. We developed a set of rules which processes metadata of a sentence generated from natural language process to extract terms and identifying definitions.

II. FORWARD REFERENCE

Reference [8] defines forward reference as a state of an element in a document which refers a feature of a solution domain which precedes its definition. It suggests that forward reference in requirements specification document refers to a first appearance of a term in passage which precedes its definition. Let's consider one of problem descriptions from ACM's OOPSLA DesignFest® online source (http://designfest.acm.org/) shown in Figure 1.

We can see that a sentence in line 4 contains a term "case worker". The term "case worker" in the document refers to a role in the respective solution domain. At the point where it is first referred, the term "case worker" has not been described or defined yet. Its description can be found later in line 15. It can be concluded that the sentence in line 4 contains a forward reference.

The goal of this research is to provide an approach to assist requirements engineer in producing a high quality requirements document which is forward reference free. This approach is designed to identify the occurrence of forward reference in software requirements specification document.

ID. FORWARD REFERENCE DETECTION

The approach is designed in a number of processes, as shown in Figure 2. First, the requirements specification document is processed using element extractor module to extract relevant elements. Second, a natural language processing module processes each extracted element to generate metadata of each element, such as part of speech, sentence structure, and word dependency. Third, given the metadata, a term identifier module identifies any term resides in an element. Fourth, using the same metadata, a definition identifier module classifies whether an element is a definition and identifies what specific term the element defined. Finally, a pigeon-hole module direct the term found by term identifier module to a list of defined term or a list of undefined term respectively.

A. Element Extractor

A document is composed by one or more set of elements. Each set of element has certain type. In software requirements specification document, type of element may be one of the following, title, section, subsection, paragraph, table, figure, sub-title, table header, cell, header, footer, and page number. There are a number of document element types which are not considered in forward reference detection. Title, section and sub-section are examples of document element which often expressed as term or contain one or more terms. The term or terms in this document element are not relevant to forward reference. Our approach considers only paragraph, sub-title, and cell elements for the forward reference detection process. These elements represent the description about software requirements of the solution domain. Aside from the three elements, figure is also describing about software requirements. Nevertheless, our current approach does not consider any term resides in a figure due to the fact that the treatment should be similar to cell element in a table. A module which extract text element from graphical component is necessary to be added.

A paragraph element is a type of element that contains a set of sentence elements. Sentence element is a set of words that compose a sentence. In requirements document, a sentence usually take a form of a statement. One of the sentence elements should be the main idea of the respective paragraph. Each paragraph element is decomposed into sentence elements. A sub-title element is a type of element that indicates what a figure is describing about. The following is an example of subtitles.

"Fig. 1 Architecture design of rotary lock. "

We can see that phrase marked in bold represents term being referred in respective elements. The sub-title contains a term but does not contain a term definition. Cell element is a text that resides in a cell of a table. This may apply to any document element. Element extractor is designed to extract each relevant element in a document and its respective element type. The process ignores irrelevant element, such as titles and sections. Sequentially, these relevant elements are fed to the next process, i.e. Natural Language Processor.

B. Natural Language Processor

Each relevant element is processed using a natural language processor (NLP). This module traverses through the list of elements and generates metadata from each given element. This module uses OpenNLP to produce part of speech (POS) tags, terms, and worddependencies. For example, consider the following document element el.

ei\ "The program's input is a stream of characters whose end is signaled with a special end of text character, ET." (source http://www.designfest.org)

Document element ei is a sentence element. The NLP uses en-pos-maxent model to generate POS tags out of ei. The following are the POS tags generated for document element ei.

The/DT program/NN 's/POS input/NN is/VBZ a/DT stream/NN of/IN characters/NNS whose/WP$ end/NN is/VBZ signaled/VBN with/IN a/DT special/JJ end/NN of/IN text/NN character/NN ,/, ET/NNP ./.

C. Term Identifier

The term identifier chunks the given tagged sentence. It chunks the given tagged sentence into a set of tagged phrases. The following is part of chunking result of ei.

[NP The program's input/NNP] [VP is/VBZ] [NP a/DT stream/NN of/IN characters/NNS] [NP whose/WP$] [NP end/NN] [VP is/VBZ signaled/VBN] [PP with/IN] [NP a/DT special/JJ end/NN of/IN text/NN character/NN] [NP ET/NNP] [./.]

Each chunk is a candidate term. As already mention, this work only considers chunk with NP tag. Therefore, given ei, the chunking process returns the following terms (after removing any determinant or cardinal): program's input, stream of characters, special end of text character, and ET. At the end, NLP removes any commonly known terms using Wikipedia. This last part removes the first two terms and left one term as a result, i.e. program's input and ET.

D. Definition Identifier

Like term identifier, definition identifier also consumes tagged sentence produced by NLP. Parallel to term identifier, the definition identifier identifies any definition of a term resides in a document element and decide whether a document element contains a definition of a term. A definition is a clause that explains, formulates, or describes a term. The process determines a clause as a definition base on a set of rules. A rule is a pattern that comprises of a word dependency tree with its given POS tags. The pattern is generated by analyzing a sentence corpus of term definition. We managed to generate 7 patterns for a sentence that contains a term definition. The following is the list of rules to identify a term definition.

NP(NN | NNP) VBZ + VBN+IN

NP(NN | NNP) VBZ + DT+NN

NP(NN | NNP) VBZ

NP(NN | NNP) VBZ + IN

NP(NN | NNP) VBZ + DT+NN+IN+ WHNP

VP (VBZ + VBN+IN + NP(NN/NNP))

For example, given the document element ei, we can see its sentence structure as shown in Figure 3. We can determine that ei matches the rule: NP (NN) VBZ + DT + NN. Therefore, it can be concluded that document element ei is an element that contains a term definition, where the term is the NP-tree ("The program's input").

E. Pigeon Hole

Both previous modules provides input for pigeon hole process. First, for a given sentence element, if it contains a term definition, it adds the respective term into the list of defined terms if and only if it is not listed in the defined term list. Second, for each term found in a document element, it marks the respective element as forward referencing if and only if the term is not listed in the defined term list.

For example, let's assume a list of defined term dt{program's input}. Given the document element ei, the approach determines that el is forward referencing. It is because the term "£T" does not exist in dt, which means that it

IV. DISCUSSION

For experimentation purpose, this research collects 30 requirements document from various sources. They are part of different kinds of projects, such as student projects, web-based applications, information system, eBill, games, and embedded system. A non-IT person who has academic background in linguistic was asked to identify document elements that contains terms, which are forward referenced (FR) or predefined (PR). To measure the performance of the proposed approach, kappa statistics is chosen [9]. This method is chosen because it can measure how reliable the approach to perform as an expert, in this context is a person who has non-IT background. Table 1 shows the result of our experimentation. It can be calculated that the overall kappa value for 30 documents is 0.828. It can be interpreted base on [10] that there is almost perfect agreement between proposed method and expert in determining which term is forward reference and which term is predefined.

VI. CONCLUSION

Requirements engineers have the responsibility to produce a high quality requirements specification document. The effort to maintain the quality of a requirements specification document manually is relatively big and may take significant resources of software development project. Forward reference is one of the elements that reduce the quality. This research aims to provide an approach to detect any instance of forward reference within a document using natural language processing. The experiment on 30 requirements documents from various domains reveals of software project indicates that the proposed approach has considerably almost perfect agreement with domain expert in detecting forward reference in software requirements document, given 0.83 kappa index value.

References

REFERENCES

[1] Meyer, B. 1985. On Formalism in Specifications. IEEE Software, 2(1), January 1985, 6-26.

[2] Muliawan, L W. Muliawan and Siahaan, D.O. 2012. Software Requirements Ambiguity Analysis based on SMART Requirements (Analisis Ambiguitas Kebutuhan Perangkat Lunak Berdasarkan Acuan SMART Requirements). In Manajemen Teknologi Informasi, SEMNAS XIV, Surabaya, Indonesia, 2012.

[3] Hussain, L, Ormandjieva, O., and Kosseim, L. 2007. Automatic Quality Assessment of SRS Text by Means of a Decision-Tree-Based Text Classifier. In Proceeding of 7th International Conference on Quality Software, Portland USA, p.209-218.

[4] Gnesi, S., Fabbrini, F. Fusani, M., and Trentanni, G. 2005. An automatic tool for the analysis of natural language requirements. International Journal of Computer Systems Science & Engineering, vol. 20(1), pp. 53-62.

[5] Kamsties, E., Berry, D. M., and Paech, B. 2001., Detecting Ambiguities in Requirements Documents Using Inspections, in Proceedings of the First Workshop on Inspection in Software Engineering (WISE'01), pp. 68-80.

[6] Denger, C., Berry, D. M., and Kamsties, E. 2003. Higher quality requirements specifications through natural language patterns. In Proc. of the IEEE Int. Conf. on Software - Sei. Tech. and Eng, pp. 80-91.

[7] Marneffe, M.D., Rafferty, AN., and Manning, C. D. 2008. Finding contradictions in text. In ACL 2008.

[8] Siahaan, D. 2012. Software Requirements Analysis (Analisa Kebutuhan Dalam Rekayasa Perangkat Lunak. Penerbit Andi.

[9] Cohen, J. 1960. A Coefficient of Agreement for Nominal Scales, t Educ. Psycho!, Meat. 20:37-46.

[10] J. R. Landis, J.R. and Koch, G.G. 1977. The Measurement of Observer Agreement for Categorical Data. Biometrics, vol. 33(1), pp. 159-174.

AuthorAffiliation

Daniel Siahaan1, Izzatul Umami2

AuthorAffiliation

Daniel Siahaan is with Informatics Departement, Institut Teknologi Sepuluh Nopember, Surabaya-Indonesia, 60111, Indonesia, email:[email protected].

Izzatul Umami with Informatics Departement, University of Darul Ulum Jombang, Jombang-, Indonesia. E-mail: [email protected]

Word count: 2617

Show less

Abstract

Translate

Meyer's seven sins have been recognized as types of mistakes that a requirements specialist are often fallen to when specifying requirements. Such mistakes play a significant role in plunging a project into failure. Many researchers were focusing in ambiguity and contradiction type of mistakes. Other types of mistakes have been given less attentions. Those mistakes often happened in reality and may equally costly as the first two mistakes. This paper introduces an approach to detect forward reference. It traverses through a requirements document, extracts, and processes each statement. During the statement extraction, any terms that may reside in the statement is also extracted. Based on certain rules which utilize POS patterns, the statement is classified as a term definition or not. For each term definition, a term is added to a list of defined terms. At the same time, every time a new term is found in a statement, it is check against the list of defined terms. If it is not found, then the requirements statement is classified as statement with forward reference. The experimentation on 30 requirements documents from various domains of software project shows that the approach has considerably almost perfect agreement with domain expert in detecting forward reference, given 0.83 kappa index value. [PUBLICATION ABSTRACT]

Details

Title

NATURAL LANGUAGE PROCESSING FOR DETECTING FORWARD REFERENCE IN A DOCUMENT

Author

Siahaan, Daniel; Umami, Izzatul

Pages

138-142

Publication year

2012

Publication date

Nov 2012

Publisher

IPTEK, The Journal for Technology and Science

ISSN

08534098

e-ISSN

20882033

Source type

Scholarly Journal

Language of publication

English

ProQuest document ID

1436793395

NATURAL LANGUAGE PROCESSING FOR DETECTING FORWARD REFERENCE IN A DOCUMENT

Jump to:

Full text

Abstract

Details

Suggested sources