Knowledge in Compiler Construction

Symbol table and Grouping

Symbol-Table ManagementThe symbol table stores information about program variables that will be used across phases. Typically, this includes type information and storage location.A possible point of confusion: the storage location does not give the location where the compiler has stored the variable. Instead, it gives the location where the compiled program will store the variable.The Grouping of Phases into PassesLogically each phase is viewed as a separate program that reads input and produces output for the next phase, i.e., a pipeline. In practice some phases are combined into a pass.For example one could have the entire front end as one pass.The term pass is used to indicate that the entire input is read during this activity. So two passes, means that the input is read twice. We have discussed two pass approaches for both assemblers and linkers. If we implement each phase separately and use multiple passes for some of them, the compiler will perform a large number of I/O operations, an expensive undertaking.As a result techniques have been developed to reduce the number of passes. We will see in the next chapter how to combine the scanner, parser, and semantic analyzer into one phase. Consider the parser. When it needs another token, rather than reading the input file (presumably produced by the scanner), the parser calls the scanner instead. At selected points during the production of the syntax tree, the parser calls the intermediate-code generatorwhich performs semantic analysis as well as generating a portion of the intermediate code.For pedagogical reasons, we will not be employing this technique. Thus your compiler will consist of separate programs for the scanner, parser, and semantic analyzer / intermediate code generator. Reducing the number of passesOne problem with combining phases, or with implementing a single phase in one pass, is that it appears that an internal form of the entire program will need to be stored in memory. This problem arises because the downstream phase may need early in its execution, information that the upstream phase produces only late in its execution. This motivates the use of symbol tables and a two pass approach. However, a clever one-pass approach is often possible.Consider the assembler (or linker). The good case is when the definition precedes all uses so that the symbol table contains the value of the symbol prior to that value being needed. Now consider the harder case of one or more uses preceding the definition. When a not-yet-defined symbol is first used, an entry is placed in the symbol table, pointing to this use and indicating that the definition has not yet appeared. Further uses of the same symbol attach their addresses to a linked list of “undefined uses” of this symbol. When the definition is finally seen, the value is placed in the symbol table, and the linked list is traversed inserting the value in all previously encountered uses. Subsequent uses of the symbol will find its definition in the table.This technique is called backpatching.

Compiler Construction Tools

Compiler-construction toolsOriginally, compilers were written “from scratch”, but now the situation is quite different. A number of tools are available to ease the burden.We will study tools that generate scanners and parsers. This will involve us in some theory, regular expressions for scanners and various grammars for parsers. These techniques are fairly successful. One drawback can be that they do not execute as fast as “hand-crafted” scanners and parsers.We will also see tools for syntax-directed translation and automatic code generation. The automation in these cases is not as complete.Finally, there is the large area of optimization. This is not automated; however, a basic component of optimization is “data-flow analysis” (how values are transmitted between parts of a program) and there are tools to help with this task.

Evolution of Programming Language

The Evolution of Programming LanguagesImpacts on CompilersHigh performance compilers (i.e., the code generated performs well) are crucial for the adoption of new language concepts and computer architectures. Also important is the resource utilization of the compiler itself.Modern compilers are large. On my laptop the compressed source of gcc is 38MB so uncompressed it must be about 100MB.The Science of Building a CompilerModeling in Compiler Design and ImplementationWe will encounter several aspects of computer science during the course. Some, e.g., trees, I'm sure you already know well. Other, more theoretical aspects, such as nondeterministic finite automata, may be new.The Science of Code OptimizationWe will do very little optimization. That topic is typically the subject of a second compiler course. Considerable theory has been developed for optimization, but sadly we will see essentially none of it. We can, however, appreciate the pragmatic requirements.The optimizations must be correct (in allcases).Performance must be improved for most programs.The increase in compilation time must be reasonable.The implementation effort must be reasonable.

Application of Compiler Technology

Applications of Compiler TechnologyImplementation of High-Level Programming LanguagesAbstraction: All modern languages support abstraction. Data-flow analysis permits optimizations that significantly reduce the execution time cost of abstractions.Inheritance: The increasing use of smaller, but more numerous, methods has made interprocedural analysis important. Also optimizations have improved virtual method dispatch.Array bounds checking in Java and Ada: Optimizations have been produced that eliminate many checks.Garbage collection in Java: Improved algorithms.Dynamic compilation in Java: Optimizations to predict/determine parts of the program that will be heavily executed and thus should be the first/only parts dynamically compiled into native code.Optimization for Computer ArchitecturesParallelismMajor research efforts had lead to improvements inAutomatic parallelization: Examine serial programs to determine and expose potential parallelism.Compilation of explicitly parallel languages.Memory HierarchiesAll machines have a limited number of registers, which can be accessed much faster than central memory. All but the simplest compilers devote effort to using this scarce resource effectively. Modern processors have several levels of caches and advanced compilers produce code designed to utilize the caches well.Design of New Computer ArchitecturesRISC (Reduced Instruction Set Computer)RISC computers have comparatively simple instructions, complicated instructions require several RISC instructions. A CISC, Complex Instruction Set Computer, contains both complex and simple instructions. A sequence of CISC instructions would be a larger sequence of RISC instructions. Advanced optimizations are able to find commonality in this larger sequence and lower the total number of instructions. The CISC Intel x86 processor line 8086/80286/80386/... had a major change with the 686 (a.k.a. pentium pro). In this processor, the CISC instructions were decomposed into RISC instructions by the processor itself. Currently, code for x86 processors normally achieves highest performance when the (optimizing) compiler emits primarily simple instructions.Specialized ArchitecturesA great variety has emerged. Compilers are produced before the processors are fabricated. Indeed, compilation plus simulated execution of the generated machine code is used to evaluate proposed designs.

Program Translation in Compiler Design

Program TranslationsBinary TranslationThis means translating from one machine language to another. Companies changing processors sometimes use binary translation to execute legacy code on new machines. Apple did this when converting from Motorola CISC processors to the PowerPC. An alternative is to have the new processor execute programs in both the new and old instruction set. Intel had the Itanium processor also execute x86 code. Apple, however, did not produce their own processors.With the recent dominance of x86 processors, binary translators from x86 have been developed so that other microprocessors can be used to execute x86 software.Hardware SynthesisIn the old days integrated circuits were designed by hand. For example, the NYU Ultracomputer research group in the 1980s designed a VLSI chip for rapid interprocessor coordination. The design software we used essentially let you paint. You painted blue lines where you wanted metal, green for polysilicon, etc. Where certain colors crossed, a transistor appeared.Current microprocessors are much too complicated to permit such a low-level approach. Instead, designers write in a high level description language which is compiled down the specific layout.Database Query InterpretersThe optimization of database queries and transactions is quite a serious subject.Compiled SimulationInstead of simulating a designs on many inputs, it may be faster to compiler the design first into a lower level representation and then execute the compiled version.

Software Productivity Tool

Software Productivity ToolsDataflow techniques developed for optimizing code are also useful for finding errors. Here correctness is not an absolute requirement, a good thing since finding all errors in undecidable.Type CheckingTechniques developed to check for type correctness (we will see some of these) can be extended to find other errors such as using an uninitialized variable.Bounds CheckingAs mentioned above optimizations have been developed to eliminate unnecessary bounds checking for languages like Ada and Java that perform the checks automatically. Similar techniques can help find potential buffer overflow errors that can be a serious security threat.Memory-Management ToolsLanguages (e.g., Java) with garbage collection cannot have memory leaks (failure to free no longer accessible memory). Compilation techniques can help to find these leaks in languages like C that do not have garbage collection.

Syntax Analysis in Compiler design

IntroductionWe will be looking at the front end, i.e., the analysis portion of a compiler.The syntax describes the form of a program in a given language, while the semanticsdescribes the meaning of that program. We will use the standard context-free grammaror BNF (Backus-Naur Form) to describe the syntaxWe will learn syntax-directed translation, where the grammar does more than specify the syntax. We augment the grammar with attributes and use this to guide the entire front end.The front end discussed in this chapter has as source language infix expressions consisting of digits, +, and -. The target language is postfix expressions with the same components. The compiler will convert7+4-5 to 74+5-.Actually, our simple compiler will handle a few other operators as well.We will tokenize the input (i.e., write a scanner), model the syntax of the source, and let this syntax direct the translation all the way to three-address code, our intermediate language.Syntax DefinitionDefinition of GrammarsA context-free grammar (CFG) consists ofA set of terminal tokens.A set of nonterminals.A set of productions (rules for transforming nonterminals).A specific nonterminal designated as start symbol.Example: Terminals: 0 1 2 3 4 5 6 7 8 9 + - Nonterminals: list digit Productions: list → list + digit list → list - digit list → digit digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Start symbol: list If no start symbol is specifically designated, the LHS of the first production is the start symbol.DerivationsWatch how we can generate the input 7+4-5 starting with the start symbol, applying productions, and stopping when no productions are possible (we have only terminals). list → list - digit → list - 5 → list + digit - 5 → list + 4 - 5 → digit + 4 - 5 → 7 + 4 - 5 This process of applying productions, starting with the start symbol and ending when only terminals are present is called aderivation and we say that the final string has been derived from the initial string (in this case the start symbol).The set of all strings derivable from the start symbol is the language generated by the CFGIt is important that you see that this context-free grammar generates precisely the set of infix expressions with single digits as operands (so 25 is not allowed) and + and - as operators.The way you get different final expressions is that you make different choices of which production to apply. There are 3 productions you can apply to list and 10 you can apply to digit.The result cannot have blanks since blank is not a terminal.The empty string is not possible since, starting from list, we cannot get to the empty string. If we wanted to include the empty string, we would add the productionlist → εThe idea is that the input language to the compiler is approximately the language generated by the grammar. It is approximate since I have ignored the scanner.Given a grammar, parsing a string consists of determining if the string is in the language generated by the grammar. If it is in the language, parsing produces a derivation. If it is not, parsing reports an error.The opposite of derivation is reduction, that is, the LHS of a production, produces the RHS (a derivation) and the RHS is reduced by the production to the LHS.

Parse tree and ambiguity in compiler design

Parse treesWhile deriving 7+4-5, one could produce the Parse Tree shown on the right.You can read off the productions from the tree. For any internal (i.e., non-leaf) tree node, its children give the right hand side (RHS) of a production having the node itself as the LHS.The leaves of the tree, read from left to right, is called the yield of the tree. We call the tree a derivation of its yield from its root. The tree on the right is a derivation of 7+4-5 from list.AmbiguityAn ambiguous grammar is one in which there are two or more parse trees yielding the same final string. We wish to avoid such grammars.The grammar above is not ambiguous. For example 1+2+3 can be parsed only one way; the arithmetic must be done left to right. Note that I am not giving a rule of arithmetic, just of this grammar. If you reduced 2+3 to list you would be stuck since it is impossible to further reduce 1+list (said another way it is not possible to derive 1+list from the start symbol).

Operators associativity and precedence

Associativity of operatorsOur grammar gives left associativity. That is, if you traverse the parse tree in postorder and perform the indicated arithmetic you will evaluate the string left to right. Thus 8-8-8 would evaluate to -8. If you wished to generate right associativity (normally exponentiation is right associative, so 2**3**2 gives 512 not 64), you would change the first two productions to list → digit + list list → digit - list Produce in class the parse tree for 7+4-5 with this new grammar.Precedence of operatorsWe normally want * to have higher precedence than +. We do this by using an additional nonterminal to indicate the items that have been multiplied. The example below gives the four basic arithmetic operations their normal precedence unless overridden by parentheses. Redundant parentheses are permitted. Equal precedence operations are performed left to right. expr → expr + term | expr - term | term term → term * factor | term / factor | factor factor → digit | ( expr ) digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 We use | to indicate that a nonterminal has multiple possible right hand side. So A → B | C is simply shorthand for A → B A → C Do the examples 1+2/3-4*5 and (1+2)/3-4*5 on the board.Note how the precedence is enforced by the grammar; slick!StatementsKeywords are very helpful for distinguishing statements from one another.stmt → id := expr | if expr then stmt | if expr then stmt else stmt | while expr do stmt | begin opt-stmts end opt-stmts → stmt-list | ε stmt-list → stmt-list ; stmt | stmt Remarks:opt-stmts stands for optional statements. The begin-end block can be empty in some languages.The ε (epsilon) stands for the empty string.The use of epsilon productions will add complications.Some languages do not permit empty blocks For example, Ada has a nullstatement, which does nothing when executed, for this purpose.The above grammar is ambiguous!The notorious “dangling else” problem.How do you parse if x then if y then z=1 else z=2?

Syntax directed translation

Syntax-Directed TranslationSpecifying the translation of a source language construct in terms of attributes of its syntactic components. The basic idea is use the productions to specify a (typically recursive) procedure for translation. For example, consider the production stmt-list → stmt-list ; stmt To process the left stmt-list, weCall ourselves recursively to process the right stmt-list (which is smaller). This will, say, generate code for all the statements in the right stmt-list.Call the procedure for stmt, generating code for stmt.Process the left stmt-list by combining the results for the first two steps as well as what is needed for the semicolon (a terminal so we do not further delegate its actions). In this case we probably concatenate the code for the right stmt-list and stmt.To avoid having to say the right stmt-listwe write the production as stmt-list → stmt-list1 ; stmt where the subscript is used to distinguish the two instances of stmt-list.

Postfix Notation in Compiler Design

Postfix notationThis notation is called postfix because the rule is operator afteroperand(s). Parentheses are not needed. The notation we normally use is called infix. If you start with an infix expression, the following algorithm will give you the equivalent postfix expression.Variables and constants are left alone.E op F becomes E' F' op, where E' and F' are the postfix of E and F respectively.( E ) becomes E', where E' is the postfix of E.One question is, given say 1+2-3, what is E, F and op? Does E=1+2, F=3, and op=+? Or does E=1, F=2-3 and op=+? This is the issue of precedence mentioned above. To simplify the present discussion we will start with fully parenthesized infix expressions.Example: 1+2/3-4*5Start with 1+2/3-4*5Parenthesize (using standard precedence) to get (1+(2/3))-(4*5)Apply the above rules to calculate P{(1+(2/3))-(4*5)}, where P{X} means “convert the infix expression X to postfix”.P{(1+(2/3))-(4*5)}P{(1+(2/3))} P{(4*5)} -P{1+(2/3)} P{4*5} -P{1} P{2/3} + P{4} P{5} * -1 P{2} P{3} / + 4 5 * -1 2 3 / + 4 5 * -Example: Now do (1+2)/3-4*5Parenthesize to get ((1+2)/3)-(4*5)Calculate P{((1+2)/3)-(4*5)}P{((1+2)/3) P{(4*5)} -P{(1+2)/3} P{4*5) -P{(1+2)} P{3} / P{4} P{5} * -P{1+2} 3 / 4 5 * -P{1} P{2} + 3 / 4 5 * -1 2 + 3 / 4 5 * -

Synthesized Attributes in Compiler Design

Synthesized AttributesWe want to decorate the parse trees we construct with annotations that give the value of certain attributes of the corresponding node of the tree. We will do the example of translating infix to postfix with 1+2/3-4*5. We use the following grammar, which follows the normal arithmetic terminology where one multiplies and divides factors to obtain terms, which in turn are added and subtracted to form expressions. expr → expr + term | expr - term | term term → term * factor | term / factor | factor factor → digit | ( expr ) digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 This grammar supports parentheses, although our example does not use them. On the right is a movie in which the parse tree is build from this example.The attribute we will associate with the nodes is the postfix form of the string in the leaves below the node. In particular, the value of this attribute at the root is the postfix form of the entire source.The book does a simpler grammar (no *, /, or parentheses) for a simpler example. You might find that one easier.