Knowledge in compiler design

Evolution of Programming Language

The Evolution of Programming LanguagesImpacts on CompilersHigh performance compilers (i.e., the code generated performs well) are crucial for the adoption of new language concepts and computer architectures. Also important is the resource utilization of the compiler itself.Modern compilers are large. On my laptop the compressed source of gcc is 38MB so uncompressed it must be about 100MB.The Science of Building a CompilerModeling in Compiler Design and ImplementationWe will encounter several aspects of computer science during the course. Some, e.g., trees, I'm sure you already know well. Other, more theoretical aspects, such as nondeterministic finite automata, may be new.The Science of Code OptimizationWe will do very little optimization. That topic is typically the subject of a second compiler course. Considerable theory has been developed for optimization, but sadly we will see essentially none of it. We can, however, appreciate the pragmatic requirements.The optimizations must be correct (in allcases).Performance must be improved for most programs.The increase in compilation time must be reasonable.The implementation effort must be reasonable.

Application of Compiler Technology

Applications of Compiler TechnologyImplementation of High-Level Programming LanguagesAbstraction: All modern languages support abstraction. Data-flow analysis permits optimizations that significantly reduce the execution time cost of abstractions.Inheritance: The increasing use of smaller, but more numerous, methods has made interprocedural analysis important. Also optimizations have improved virtual method dispatch.Array bounds checking in Java and Ada: Optimizations have been produced that eliminate many checks.Garbage collection in Java: Improved algorithms.Dynamic compilation in Java: Optimizations to predict/determine parts of the program that will be heavily executed and thus should be the first/only parts dynamically compiled into native code.Optimization for Computer ArchitecturesParallelismMajor research efforts had lead to improvements inAutomatic parallelization: Examine serial programs to determine and expose potential parallelism.Compilation of explicitly parallel languages.Memory HierarchiesAll machines have a limited number of registers, which can be accessed much faster than central memory. All but the simplest compilers devote effort to using this scarce resource effectively. Modern processors have several levels of caches and advanced compilers produce code designed to utilize the caches well.Design of New Computer ArchitecturesRISC (Reduced Instruction Set Computer)RISC computers have comparatively simple instructions, complicated instructions require several RISC instructions. A CISC, Complex Instruction Set Computer, contains both complex and simple instructions. A sequence of CISC instructions would be a larger sequence of RISC instructions. Advanced optimizations are able to find commonality in this larger sequence and lower the total number of instructions. The CISC Intel x86 processor line 8086/80286/80386/... had a major change with the 686 (a.k.a. pentium pro). In this processor, the CISC instructions were decomposed into RISC instructions by the processor itself. Currently, code for x86 processors normally achieves highest performance when the (optimizing) compiler emits primarily simple instructions.Specialized ArchitecturesA great variety has emerged. Compilers are produced before the processors are fabricated. Indeed, compilation plus simulated execution of the generated machine code is used to evaluate proposed designs.

Program Translation in Compiler Design

Program TranslationsBinary TranslationThis means translating from one machine language to another. Companies changing processors sometimes use binary translation to execute legacy code on new machines. Apple did this when converting from Motorola CISC processors to the PowerPC. An alternative is to have the new processor execute programs in both the new and old instruction set. Intel had the Itanium processor also execute x86 code. Apple, however, did not produce their own processors.With the recent dominance of x86 processors, binary translators from x86 have been developed so that other microprocessors can be used to execute x86 software.Hardware SynthesisIn the old days integrated circuits were designed by hand. For example, the NYU Ultracomputer research group in the 1980s designed a VLSI chip for rapid interprocessor coordination. The design software we used essentially let you paint. You painted blue lines where you wanted metal, green for polysilicon, etc. Where certain colors crossed, a transistor appeared.Current microprocessors are much too complicated to permit such a low-level approach. Instead, designers write in a high level description language which is compiled down the specific layout.Database Query InterpretersThe optimization of database queries and transactions is quite a serious subject.Compiled SimulationInstead of simulating a designs on many inputs, it may be faster to compiler the design first into a lower level representation and then execute the compiled version.

Syntax Analysis in Compiler design

IntroductionWe will be looking at the front end, i.e., the analysis portion of a compiler.The syntax describes the form of a program in a given language, while the semanticsdescribes the meaning of that program. We will use the standard context-free grammaror BNF (Backus-Naur Form) to describe the syntaxWe will learn syntax-directed translation, where the grammar does more than specify the syntax. We augment the grammar with attributes and use this to guide the entire front end.The front end discussed in this chapter has as source language infix expressions consisting of digits, +, and -. The target language is postfix expressions with the same components. The compiler will convert7+4-5 to 74+5-.Actually, our simple compiler will handle a few other operators as well.We will tokenize the input (i.e., write a scanner), model the syntax of the source, and let this syntax direct the translation all the way to three-address code, our intermediate language.Syntax DefinitionDefinition of GrammarsA context-free grammar (CFG) consists ofA set of terminal tokens.A set of nonterminals.A set of productions (rules for transforming nonterminals).A specific nonterminal designated as start symbol.Example: Terminals: 0 1 2 3 4 5 6 7 8 9 + - Nonterminals: list digit Productions: list → list + digit list → list - digit list → digit digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Start symbol: list If no start symbol is specifically designated, the LHS of the first production is the start symbol.DerivationsWatch how we can generate the input 7+4-5 starting with the start symbol, applying productions, and stopping when no productions are possible (we have only terminals). list → list - digit → list - 5 → list + digit - 5 → list + 4 - 5 → digit + 4 - 5 → 7 + 4 - 5 This process of applying productions, starting with the start symbol and ending when only terminals are present is called aderivation and we say that the final string has been derived from the initial string (in this case the start symbol).The set of all strings derivable from the start symbol is the language generated by the CFGIt is important that you see that this context-free grammar generates precisely the set of infix expressions with single digits as operands (so 25 is not allowed) and + and - as operators.The way you get different final expressions is that you make different choices of which production to apply. There are 3 productions you can apply to list and 10 you can apply to digit.The result cannot have blanks since blank is not a terminal.The empty string is not possible since, starting from list, we cannot get to the empty string. If we wanted to include the empty string, we would add the productionlist → εThe idea is that the input language to the compiler is approximately the language generated by the grammar. It is approximate since I have ignored the scanner.Given a grammar, parsing a string consists of determining if the string is in the language generated by the grammar. If it is in the language, parsing produces a derivation. If it is not, parsing reports an error.The opposite of derivation is reduction, that is, the LHS of a production, produces the RHS (a derivation) and the RHS is reduced by the production to the LHS.

Parse tree and ambiguity in compiler design

Parse treesWhile deriving 7+4-5, one could produce the Parse Tree shown on the right.You can read off the productions from the tree. For any internal (i.e., non-leaf) tree node, its children give the right hand side (RHS) of a production having the node itself as the LHS.The leaves of the tree, read from left to right, is called the yield of the tree. We call the tree a derivation of its yield from its root. The tree on the right is a derivation of 7+4-5 from list.AmbiguityAn ambiguous grammar is one in which there are two or more parse trees yielding the same final string. We wish to avoid such grammars.The grammar above is not ambiguous. For example 1+2+3 can be parsed only one way; the arithmetic must be done left to right. Note that I am not giving a rule of arithmetic, just of this grammar. If you reduced 2+3 to list you would be stuck since it is impossible to further reduce 1+list (said another way it is not possible to derive 1+list from the start symbol).

Operators associativity and precedence

Associativity of operatorsOur grammar gives left associativity. That is, if you traverse the parse tree in postorder and perform the indicated arithmetic you will evaluate the string left to right. Thus 8-8-8 would evaluate to -8. If you wished to generate right associativity (normally exponentiation is right associative, so 2**3**2 gives 512 not 64), you would change the first two productions to list → digit + list list → digit - list Produce in class the parse tree for 7+4-5 with this new grammar.Precedence of operatorsWe normally want * to have higher precedence than +. We do this by using an additional nonterminal to indicate the items that have been multiplied. The example below gives the four basic arithmetic operations their normal precedence unless overridden by parentheses. Redundant parentheses are permitted. Equal precedence operations are performed left to right. expr → expr + term | expr - term | term term → term * factor | term / factor | factor factor → digit | ( expr ) digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 We use | to indicate that a nonterminal has multiple possible right hand side. So A → B | C is simply shorthand for A → B A → C Do the examples 1+2/3-4*5 and (1+2)/3-4*5 on the board.Note how the precedence is enforced by the grammar; slick!StatementsKeywords are very helpful for distinguishing statements from one another.stmt → id := expr | if expr then stmt | if expr then stmt else stmt | while expr do stmt | begin opt-stmts end opt-stmts → stmt-list | ε stmt-list → stmt-list ; stmt | stmt Remarks:opt-stmts stands for optional statements. The begin-end block can be empty in some languages.The ε (epsilon) stands for the empty string.The use of epsilon productions will add complications.Some languages do not permit empty blocks For example, Ada has a nullstatement, which does nothing when executed, for this purpose.The above grammar is ambiguous!The notorious “dangling else” problem.How do you parse if x then if y then z=1 else z=2?

Syntax directed translation

Syntax-Directed TranslationSpecifying the translation of a source language construct in terms of attributes of its syntactic components. The basic idea is use the productions to specify a (typically recursive) procedure for translation. For example, consider the production stmt-list → stmt-list ; stmt To process the left stmt-list, weCall ourselves recursively to process the right stmt-list (which is smaller). This will, say, generate code for all the statements in the right stmt-list.Call the procedure for stmt, generating code for stmt.Process the left stmt-list by combining the results for the first two steps as well as what is needed for the semicolon (a terminal so we do not further delegate its actions). In this case we probably concatenate the code for the right stmt-list and stmt.To avoid having to say the right stmt-listwe write the production as stmt-list → stmt-list1 ; stmt where the subscript is used to distinguish the two instances of stmt-list.

Postfix Notation in Compiler Design

Postfix notationThis notation is called postfix because the rule is operator afteroperand(s). Parentheses are not needed. The notation we normally use is called infix. If you start with an infix expression, the following algorithm will give you the equivalent postfix expression.Variables and constants are left alone.E op F becomes E' F' op, where E' and F' are the postfix of E and F respectively.( E ) becomes E', where E' is the postfix of E.One question is, given say 1+2-3, what is E, F and op? Does E=1+2, F=3, and op=+? Or does E=1, F=2-3 and op=+? This is the issue of precedence mentioned above. To simplify the present discussion we will start with fully parenthesized infix expressions.Example: 1+2/3-4*5Start with 1+2/3-4*5Parenthesize (using standard precedence) to get (1+(2/3))-(4*5)Apply the above rules to calculate P{(1+(2/3))-(4*5)}, where P{X} means “convert the infix expression X to postfix”.P{(1+(2/3))-(4*5)}P{(1+(2/3))} P{(4*5)} -P{1+(2/3)} P{4*5} -P{1} P{2/3} + P{4} P{5} * -1 P{2} P{3} / + 4 5 * -1 2 3 / + 4 5 * -Example: Now do (1+2)/3-4*5Parenthesize to get ((1+2)/3)-(4*5)Calculate P{((1+2)/3)-(4*5)}P{((1+2)/3) P{(4*5)} -P{(1+2)/3} P{4*5) -P{(1+2)} P{3} / P{4} P{5} * -P{1+2} 3 / 4 5 * -P{1} P{2} + 3 / 4 5 * -1 2 + 3 / 4 5 * -

Synthesized Attributes in Compiler Design

Synthesized AttributesWe want to decorate the parse trees we construct with annotations that give the value of certain attributes of the corresponding node of the tree. We will do the example of translating infix to postfix with 1+2/3-4*5. We use the following grammar, which follows the normal arithmetic terminology where one multiplies and divides factors to obtain terms, which in turn are added and subtracted to form expressions. expr → expr + term | expr - term | term term → term * factor | term / factor | factor factor → digit | ( expr ) digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 This grammar supports parentheses, although our example does not use them. On the right is a movie in which the parse tree is build from this example.The attribute we will associate with the nodes is the postfix form of the string in the leaves below the node. In particular, the value of this attribute at the root is the postfix form of the entire source.The book does a simpler grammar (no *, /, or parentheses) for a simpler example. You might find that one easier.

Syntax-Directed Definitions (SDDs)

Definition: A syntax-directed definition is a grammar together with semantic rulesassociated with the productions. These rules are used to compute attribute values. A parse tree augmented with the attribute values at each node is called an annotated parse tree.For the bottom-up approach I will illustrate now, we annotate a node after having annotated its children. Thus the attribute values at a node can depend on the children of the node but not the parent of the node. We call these synthesizedattributes, since they are formed by synthesizing the attributes of the children.In chapter 5, when we study top-down annotations as well, we will introduce inherited attributes that are passed down from parents to children.We specify how to synthesize attributes by giving the semantic rules together with the grammar. That is we give the syntax directed definition.ProductionSemantic Ruleexpr → expr1 + termexpr.t := expr1.t || term.t || '+'expr → expr1 - termexpr.t := expr1.t || term.t || '-'expr → termexpr.t := term.tterm → term1 * factorterm.t := term1.t || factor.t || '*'term → term1 / factorterm.t := term1.t || factor.t || '/'term → factorterm.t := factor.tfactor → digitfactor.t := digit.tfactor → ( expr )factor.t := expr.tdigit → 0digit.t := '0'digit → 1digit.t := '1'digit → 2digit.t := '2'digit → 3digit.t := '3'digit → 4digit.t := '4'digit → 5digit.t := '5'digit → 6digit.t := '6'digit → 7digit.t := '7'digit → 8digit.t := '8'digit → 9digit.t := '9'We apply these rules bottom-up (starting with the geographically lowest productions, i.e., the lowest lines on the page) and get the annotated graph shown on the right. The annotation are drawn in green.

Simple Syntax-Directed Definitions

Simple Syntax-Directed DefinitionsIf the semantic rules of a syntax-directed definition all have the property that the new annotation for the left hand side (LHS) of the production is just the concatenation of the annotations for the nonterminals on the RHS in the same order as the nonterminals appear in the production, we call the syntax-directed definition simple. It is still called simple if new strings are interleaved with the original annotations. So the example just done is a simple syntax-directed definition.Remark: SDD's feature semantic rules. We will soon learn about Translation Schemes, which feature a related concept called semantic actions. When one has a simple SDD, the corresponding translation scheme can be done without constructing the parse tree. That is, while doing the parse, when you get to the point where you would construct the node, you just do the actions. In the corresponding translation scheme for present example, the action at a node is just to print the new strings at the appropriate points.

Tree traversal in compiler design

Tree TraversalsWhen traversing a tree, there are several choices as to when to visit a given node. The traversal can visit the nodeBefore visiting any children.Between visiting children.After visiting all the childrenI do not like the book's code as I feel the names chosen confuses the traversal with visiting the nodes. I prefer the following pseudocode, which also illustrates traversals that are not depth first. Comments are introduced by -- and terminate at the end of the line. procedure traverse (n: node) -- visit(n); before children if (n is a leaf) return; c = first child; traverse (c); while more children -- visit (n); between children c = next child; traverse (c); end while; -- visit (n); after children end traverse; If you uncomment just the first visit, we have a preorder traversal, where each node is visited before its children.If you uncomment just the last visit, we have a postorder traversal or depth-first traversal, where each node is visited after all its children.If you have a binary tree (all non-leaf nodes have exactly 2 children) and you uncomment only the middle visit, we have an inorder traversal.If you uncomment all three visits, we have an Euler-tour traversal.If you uncomment two of the three visits, we have an unnamed traversal.If you uncomment none of the visits, we have a program that accomplishes very little.