Content area
Full text
Topical keywords: parser, grammar, software analysis
Abstract - This paper presents a revolutionary way to parse computer programming languages without a traditional grammar. The motivation behind this approach is to dramatically increase scalability. The intention is to be able to parse and analyze billions of lines of code written in hundreds of programming languages. To achieve that goal, it is advantageous to have sharable, open-source, modular ways for defining the syntax and semantics of programming languages. The new parsing technique replaces a traditional grammar with a computer program, referred to as a Programmar (short for program and grammar). All the basic operations in BNF (sequencing, alternation, optional terms, repeating and grouping) are supported, and the Java code is both sharable and modular. This parsing approach enables dozens or even hundreds of developers to work on computer program analysis concurrently, while avoiding many of the consistency issues encountered when building grammars and associated code analysis tools.
(ProQuest: ... denotes formulae omitted.)
1 Introduction
Businesses around the world today collectively have billions of lines of production software written in legacy computer languages like COBOL, RPG, PL/I, Fortran and Natural. These organizations are highly motivated to modernize their software for a number of reasons, including difficulties in maintaining old, brittle code [1] and in hiring people with legacy skillsets [2]. Unfortunately the modernization process is often either prohibitively expensive or produces new software of low quality that is difficult to maintain going forward into the future [3]. Available modernization tools (e.g. [4 to 8]) tend not to be scalable enough to handle large, complex software systems that can be comprised of tens of millions of lines of code written in multiple programming languages.
For the past several decades, legacy software analysis tools have been typified by the type of parser generated by Yet-Another-Compiler-Compiler (YACC) [9]. Such a parser interprets computer program code based on a Context-Free Grammar, which is a declarative description of the syntax of a specific programming language. This parsing process relies on a separate token pre-processor (typically LEX, the Lexical Analyzer [10]) and generates an Abstract Syntax Tree (AST).
Modern programming languages also continue to evolve and require solid analysis approaches (e.g. [11 to 13]). For example, managing deprecated code often requires detailed...




