Content area
The paper presents the design of a training compiler which is developed for the purposes of education in compilers and language processors in computer science courses. The presented compiler has the following main advantages compared to known training compilers used in various universities - a simplified modular structure and the building of an explicit abstract syntactic tree of the input program. The modules in the compiler structure are lexical analyzer, syntactic analyzer, semantic analyzer and code generator. This separation allows students to effectively study the main stages of compilation - lexical analysis, parsing, semantic analysis and code generation. Building and visualizing an explicit abstract syntax tree helps students to understand the translation of the program into the compiler's front-end and make the transition to the compiler's back-end. The compiler translates a program written in a high-level language into virtual machine code. An interpreter to execute the generated virtual machine code is also presented. The presented design is compared to other known training compilers used in various university courses. The input language is procedure-oriented and is a subset of the C and Java languages, which makes it easier for students to use it. Language has enough resources to solve many practical problems. The input program for the compiler is a sequence of definitions of variables and functions. The language of the training compiler is strongly typed. Variables, constants and expressions are related to a specific type. Input-output operations require a certain type of arguments, arithmetic-logical operations are defined for specific types of arguments and type of returned result. At the end of the paper are presented the results of the work of the training compiler in translating a sample input program to code for a virtual machine. The results demonstrate the output of each compiler module - a token stream, an abstract syntax tree, and a set of virtual machine instructions. The structure of the presented training compiler can be used for different input languages in training on compilers and language processors.
Abstract: The paper presents the design of a training compiler which is developed for the purposes of education in compilers and language processors in computer science courses. The presented compiler has the following main advantages compared to known training compilers used in various universities - a simplified modular structure and the building of an explicit abstract syntactic tree of the input program. The modules in the compiler structure are lexical analyzer, syntactic analyzer, semantic analyzer and code generator. This separation allows students to effectively study the main stages of compilation - lexical analysis, parsing, semantic analysis and code generation. Building and visualizing an explicit abstract syntax tree helps students to understand the translation of the program into the compiler's front-end and make the transition to the compiler's back-end. The compiler translates a program written in a high-level language into virtual machine code. An interpreter to execute the generated virtual machine code is also presented. The presented design is compared to other known training compilers used in various university courses. The input language is procedure-oriented and is a subset of the C and Java languages, which makes it easier for students to use it. Language has enough resources to solve many practical problems. The input program for the compiler is a sequence of definitions of variables and functions. The language of the training compiler is strongly typed. Variables, constants and expressions are related to a specific type. Input-output operations require a certain type of arguments, arithmetic-logical operations are defined for specific types of arguments and type of returned result. At the end of the paper are presented the results of the work of the training compiler in translating a sample input program to code for a virtual machine. The results demonstrate the output of each compiler module - a token stream, an abstract syntax tree, and a set of virtual machine instructions. The structure of the presented training compiler can be used for different input languages in training on compilers and language processors.
Keywords: Training Compiler; Lexical Analyzer; Parser; Semantic Analyzer; Code Generator; Formal Grammars and Languages
(ProQuest: ... denotes formulae omited.)
INTRODUCTION
Teaching the fundamentals of compilation of programming languages is a challenging task in the education of students in computer science. The subject introduces students to important areas as compilers, language processors, formal grammars. The area includes essential skills necessary for software and computer engineers. It is a complex area, which could be presented as an intersection of theory and practice in computer science. The experience show that many students have difficulties to understand the fundamentals of compilation and to complete compilers in practice.
The task of the compiler is to translate an input program written in a high-level language into an object program, virtual machine code, assembly language or intermediate language. The principles of operation of the compilers are studied for examples in disciplines called "Language Processors", "Language Processors and Grammars" "Compilers and Interpreters", "Compilers".
The authors have been facing the above mentioned challenges teaching university courses in compilers and language processors. Our purpose is to provide a training compiler to increase the effectiveness of learning compilers. The approach used contains two basic steps:
1. Research of known training compilers used in different universities in order to point out the best teaching and learning practices as well as to detect potential weaknesses and disadvantages.
2. Regarding the summarized conclusions, develop a training compiler to provide effective teaching and learning compilers and language processors in computer science curriculum.
Several training compilers used in different university courses are researched to summarize their strengths and weaknesses.
One of the researched compilers is the one used in the compilers course at Technical University of Varna [8]. The input programming language is a subset of Pascal and is called MiniPascal. The Pascal programming language is no longer taught in high schools or universities, making it completely unknown to students. The structure of Pascal's program differs significantly from the structure of the program in widely used languages such as C, C++, Java and C #. This makes it even more difficult for students to understand the language when using the compiler. Time is wasted learning the input language, which should focus on developing the compiler.
Furthermore, the input language in the learning compiler does not provide subprograms - functions, but only procedures without parameters. This makes it impossible to reuse a subprogram and include it in an expression.
There are problems in implementation as well. The training compiler is divided into modules, each of which uses a common header file. In each subsequent module, this file is extended with new declarations and definitions. For example, existing functions are expanded with new parameters. In each module students have to transfer all changes that have been made so far in the header file of previous modules. On the one hand, this slows down the work on the new modules, and on the other hand, it is a common cause of errors.
The stages of parsing, semantic analysis and code generation are realized in the same program functions. In each following module, they become increasingly difficult to understand, implement, and test. When supplementing the next function with new functionality, errors often occur in already implemented functionality.
Furthermore, the compiler builds an implicit parse tree of an input program. This tree cannot be easily visualized, which makes it very difficult for students to make sense of the difference between input program analysis and output program synthesis. Such an implicit syntax tree cannot be used to synthesize an output program.
In the implementation of the compiler, an inappropriate hierarchy of modules is built, in which each module inheritance the previous one and which does not sufficiently demarcate the logic of the individual stages during compilation. This makes it difficult to develop and test one module independently of the others.
As a result of the listed problems and disadvantages, there is a decreasing interest of students in studying the subject.
Other known tool for training in compilers is the COOL (Classroom Object-Oriented Language) compiler developed and used for education at Stanford University [3, 13]. The COOL language is especially designed for education in compilers. It is object-oriented and has features of a full programming language. For the realization of the lexical and syntactic analyzers generators are used, which hide from the students the work of the basic algorithms for analysis of the input program.
ChocoPy is a programming language designed for undergraduate classes in language processors and compilers and interpreters at the University of California, Berkeley [6, 14]. ChocoPy is a restricted subset of Python 3, which can easily be compiled to a target architecture such as RISC-V. The language is fully specified using formal grammar, typing rules and operational semantics. ChocoPy programs can be executed directly in a Python (3.6+) interpreter. ChocoPy programs can also be edited using standard Python syntax highlighting.
The miniC input language compiler is a subset of C [7]. This compiler was developed at the University of Novi Sad, Serbia. The statements supported in miniC language are: simple assignment statement, simplified if statement, return statement and compound statement. Function calls with only one argument are supported. The input language is too simple and that is the main disadvantage of the compiler.
Best practices for the construction of training compilers, for the specifics of programming languages for education and for the organization of compilers courses in general are presented for example in [1, 2, 5, 6, 7, 9, 10, 11]. Some advices for training compilers concerning code generation are proposed for example in [4] and an interesting material for the structure of a virtual machine is for example [13].
As a result of the review, the following main recommendations for a training compiler are summarized:
* The input language of the training compiler should be simple but at the same time should possess all basic functionalities of the modern programming languages. The input language should also be close to the programming languages studied in previous courses; this will save students' time to dive deeper into the methods and algorithms of compilation.
* The structure of the compiler should consist of modules, each one implementing a separate compilation stage, which could be developed and tested independently from the other modules.
* The front-end stage of the compiler should produce (and visualize) an abstract syntax tree of the program which could help students better understand the structure of the input program and prepare for the back-end stage of the compiler.
* The compiler should generate code for virtual machine to demonstrate to students the principles of portability of software used by modern programming languages.
Regarding these recommendations, the authors developed a training compiler which has been used for education in "Compilers" and "Language Processors". The next parts of the paper describe the structure of the compiler, the benefits to the learning process and present some results of compiler tests.
I. STRUCTURE OF THE TRAINING COMPILER
1.1Description of the training compiler modules
The structure of the training compiler proposed by the authors is presented on fig. 1.
The modules of the training compiler could be developed and tested independently. Of course, as each module uses the output of the previous ones, each module must produce correct results.
* Input-output interface - reads the text of the input program and transforms it into a stream of characters.
* Lexical Analyzer - reads the stream of characters, removes non-significant characters, and transforms it into a token stream.
* Parser - reads the stream of tokens, recognizes a valid sentence in the input language, forms an abstract syntax tree.
* Semantic Analyzer - checks scope of variables, data type compatibility in expressions and forms a symbol table using the abstract syntax tree produced by the parser.
* Code Generator - generates code for a virtual machine based on a semantically correct tree and symbol table.
* Virutal machine interpreter - reads the generated instructions, executes them and produces a result.
1.2Description of the input language
The input language for the presented compiler is called STUDENT. It is procedure-oriented and is a subset of the C and Java languages which are known to students from several previous courses. STUDENT is a high level language which is simple and at the same time provides features to solve a broad range of practical problems. It contains primitive and array data types, basic arithmetic and logical statements, expressions. The language is described by a grammar of type LL(1). It is presented to students into Extended Backus-Naur Form (EBNF) notation.
The STUDENT programming language provides the following features:
* Data types
* integer type
* boolean type
* character type
* one-dimensional array
* Variables
* global
* local
* Expressions
* Arithmetic
* logical
* Statements
* arithmetic operators (+, -, ·, /)
* logical operators (and, or, not)
* comparison operators (<,>, <=,> =,! =, ==)
* assignment operator (=)
* unary operators (-)
* conditional operator (if else)
* while loop operator
* compound operator ({...})
* input-output operators (read, print)
* Recursion
* Integer division remainder (%)
* Character escaping (e.g. '\ n', '\ t')
* Dynamically allocated array (int arr = int [10])
* Length - built-in function for finding the number of elements in an array
* Subroutines - functions that accept parameters and return a result
For example the implementation of the classical Knapsack Problem in STUDENT looks as follows.
...
1.3Experimental results
The experiments are carried out using the presented training compiler and a sample input program for recursive calculation of factorial.
...
The results from the translation of the program produced by each module are describe below.
1.3.1.Lexical analyzer
The lexical analyzer forms token into the input program and shows the tokens in the following format:
Token name : lexeme string
A part of the output for the test program is as follows:
...
1.3.2.Parser
The parser identifies syntax errors into the program or in case of successful analysis produces an abstract syntax tree of the program. The tree is shown in XML format.
...
1.3.3.Code generator
The generator produces instructions for a virtual machine. The machine has the following structure:
* program counter and an array of instructions
* stack
* heap
A part of generated instructions for the sample program is as follows:
0: ICONST 35
2: GSTORE 0
4: GLOAD 0
6: NEWARRAY 0
8: GSTORE 1
10: GOTO 79
12: GLOAD 1
14: ILOAD 0
16: IALOAD
17: ICONST 0
19: ICMPNE
20: IF_FALSE 28
22: GLOAD 1
1.3.4.Results of the program run
An interpreter for the virtual machine is also developed. It reads the stream of instructions and execute them. The final result from the run of the test program is:
1.3.5.fibonacci number is: 9227465
II. CONCLUSIONS
The presented training compiler demonstrates the typical stages of the life cycle of the compilation process. It has the following main advantages:
- The input language is based on programming languages, which are studied in several disciplines and are familiar to students.
- The compiler is platform independent because it is implemented in Java.
- The logic of the compilation process is divided into separate modules, each of which is relatively easy to understand and manage.
- The parser produces explicit abstract syntax tree, which helps students in easier understanding the program structure and in preparing for the synthesis of the object program.
- The generated instructions are similar to the instructions in the abstract model of Java virtual machine, which allows students to get acquainted with the operation of a real virtual machine.
There are many possible directions for the development of the learning compiler:
- If it is found that students are quick to absorb presentation material and succeed in setting tasks, the grammar can be upgraded from LL(1) to LL(k), which will allow new rules to be added to the input language (for example user-defined types).
- Adding multidimensional arrays.
- Adding new operators (bitwise operations).
Further experiments with different groups of students working with the training compiler will be carried out. The potential weak points will be figured out in order to improve the compiler implementation regarding more effective learning in compilers and language processors in the university course.
Reference Text and Citations
[1] Aho, A. ; Teaching the Compilers Course, ACMSIGCSEBulletin, DOI: 10.1145/1473195.1473196, 2008.
[2] Jabri, Riad, 2013. A Generic Tool for Teaching Compilers, Canadian Center of Science and Education ISSN 19138989, DOI:10.5539/cis.v6n2p134, 2013
[3] Aiken, A.; Cool: a portable project for teaching compiler construction. ACM SIGPLAN Notices, Vol. 31, Iss. 7, ACM, New York, 1996, pp. 19-24.
[4] Mahoney, W. ;Pedersen, J.; Teaching compiler code generation -Simpler is better, ACM SIGCSE Bulletin DOI:10.1145/1709424.1709440, 2010, pp. 30-34.
[5] Na,Wang; ShiMing,Zhang; Construction of compiler technology course in application-based university, International Conference on Education Technology and Information System (ICETIS 2013), 2013
[6] Padhye, Rohan; Sen,Koushik; Hilfinger ,Paul; ChocoPy: A Programming Language for Compilers Courses. SPlAsH-E 2019: Proceedings of the 2019 ACM SIGPLAN Symposium on SPLASH-E Athens, Greece,October 2019 pp. 41-45
[7] Rakic, Zorica ; Rakic ,Predrag; Petric,Tara;. miniC Project for Teaching Compilers Course, ICIST 2014 4th International Conference on Information Society and Technology, Belgrade, Serbia, 2014, pp. 360- 362
[8] Ruskov,Trifon; Valchanov,Hristo, 2007. Compilers and interpreters. Laboratory exercises , Technical University of Varna ISBN 978-954-20-0386-1, Varna, 2013
[9] Zatarain-Cabada, R.; Barron-Estrada, M. ; Zepeda-Sanchez, L.; Quintero-Meza, R. ; Reyes Garcia, C. ; A Hybrid Learning Compiler Course, Hybrid Learning. ICHL 2010. Lecture Notes in Computer Science, ISBN: 978-3-64214656-5, Vol 6248. Springer, Berlin, Heidelberg, 2010.
[10] Sekerinski, E; Zíngaro, D.; Pascal0 Compiler. Course Notes for Compiler Construction, 2011.
[11] Zhang, Haiyi; The Considerations of Teaching Computer Science Courses AI and Compilers, Proceedings of the International Conference on Frontiers in Education: Computer Science and Computer Engineering (FECS); Athens, 2018
[12] http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=Compilers
[13] https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables
[14] https://chocopy.org/
Copyright "Carol I" National Defence University 2021