Knowledge in Compiler Construction

Compiler Construction Rajasthan Technical University Old Paper(2016)

This is Rajasthan Technical University Old Paper. Compiler Construction is a subject in 7th Semester in RTU.

Compiler Design Basic

Language ProcessorsA Compiler is a translator from one language, the input or source language, to another language, the output or targetlanguage.Often, but not always, the target language is an assembler language or the machine language for a computer processor.Note that using a compiler requires a two step process to run a program.Execute the compiler (and possibly an assembler) to translate the source program into a machine language program.Execute the resulting machine language program, supplying appropriate input.This should be compared with an interpreter, which accepts the source language program and the appropriate input, and itself produces the program output.Sometimes both compilation and interpretation are used. For example, consider typical Java implementations. The (Java) source code is translated (i.e., compiled) into bytecodes, the machine language for an idealized virtual machine, the Java Virtual Machine or JVM. Then an interpreter of the JVM (itself normally called a JVM) accepts the bytecodes and the appropriate input, and produces the output. This technique was quite popular in academia, with the Pascal programming language and P-code. 

Compilation tool chain - part 1

The compilation tool chainFor large programs, the compiler is actually part of a multistep tool chain[preprocessor] → [compiler] → [assembler] → [linker] → [loader]We will be primarily focused on the second element of the chain, the compiler. Our target language will be assembly language. I give a very short description of the other components, including some historical comments.PreprocessorsPreprocessors are normally fairly simple as in the C language, providing primarily the ability to include files and expand macros. There are exceptions, however. IBM's PL/I, another Algol-like language had quite an extensive preprocessor, which made available at preprocessor time, much of the PL/I language itself (e.g., loops and I believe procedure calls).Some preprocessors essentially augment the base language, to add additional capabilities. One could consider them as compilers in their own right, having as source this augmented language (say Fortran augmented with statements for multiprocessor execution in the guise of Fortran comments) and as target the original base language (in this case Fortran). Often the “preprocessor” inserts procedure calls to implement the extensions at runtime.AssemblersAssembly code is an mnemonic version of machine code in which names, rather than binary values, are used for machine instructions, and memory addresses.Some processors have fairly regular operations and as a result assembly code for them can be fairly natural and not-too-hard to understand. Other processors, in particular Intel's x86 line, have let us charitably say more interestinginstructions with certain registers used for certain things.My laptop has one of these latter processors (pentium 4) so my gcc compiler produces code that from a pedagogical viewpoint is less than ideal. If you have a mac with a ppc processor (newest macs are x86), your assembly language is cleaner. NYU's ACF features sun computers with sparc processors, which also have regular instruction sets.

Compilation tool chain - part 2

Two pass assemblyNo matter what the assembly language is, an assembler needs to assign memory locations to symbols (called identifiers) and use the numeric location address in the target machine language produced. Of course the same address must be used for all occurrences of a given identifier and two different identifiers must (normally) be assigned two different locations.The conceptually simplest way to accomplish this is to make two passes over the input (read it once, then read it again from the beginning). During the first pass, each time a new identifier is encountered, an address is assigned and the pair (identifier, address) is stored in a symbol table. During the second pass, whenever an identifier is encountered, its address is looked up in the symbol table and this value is used in the generated machine instruction.LinkersLinkers, a.k.a. linkage editors combine the output of the assembler for several different compilations. That is the horizontal line of the diagram above should really be a collection of lines converging on the linker. The linker has another input, namely libraries, but to the linker the libraries look like other programs compiled and assembled. The two primary tasks of the linker areRelocating relative addresses.Resolving external references (such as the procedure xor() above).

Compilation tool chain - part 3

Relocating relative addressesThe assembler processes one file at a time. Thus the symbol table produced while processing file A is independent of the symbols defined in file B, and conversely. Thus, it is likely that the same address will be used for different symbols in each program. The technical term is that the (local) addresses in the symbol table for file A are relative to file A; they must be relocated by the linker. This is accomplished by adding the starting address of file A (which in turn is the sum of the lengths of all the files processed previously in this run) to the relative address.Resolving external referencesAssume procedure f, in file A, and procedure g, in file B, are compiled (and assembled) separately. Assume also that f invokes g. Since the compiler and assembler do not see g when processing f, it appears impossible for procedure f to know where in memory to find g.The solution is for the compiler to indicated in the output of the file A compilation that the address of g is needed. This is called a use of g. When processing file B, the compiler outputs the (relative) address of g. This is called the definition of g. The assembler passes this information to the linker.The simplest linker technique is to again make two passes. During the first pass, the linker records in its “external symbol table” (a table of external symbols, not a symbol table that is stored externally) all the definitions encountered. During the second pass, every use can be resolved by access to the table.

Compilation tool chain - part 4

LoadersAfter the linker has done its work, the resulting “executable file” can be loaded by the operating system into central memory. The details are OS dependent. With early single-user operating systems all programs would be loaded into a fixed address (say 0) and the loader simply copies the file to memory. Today it is much more complicated since (parts of) many programs reside in memory at the same time. Hence the compiler/assembler/linker cannot know the real location for an identifier. Indeed, this real location can change.

Structure of Compiler

The Structure of a CompilerModern compilers contain two (large) parts, each of which is often subdivided. These two parts are the front end, shown in green on the right and the back end, shown in pink.The front end analyzes the source program, determines its constituent parts, and constructs an intermediate representation of the program. Typically the front end is independent of the target language.The back end synthesizes the target program from the intermediate representation produced by the front end. Typically the back end is independent of the source language.This front/back division very much reduces the work for a compiling system that can handle several (N) source languages and several (M) target languages. Instead of NM compilers, we need N front ends and M back ends. For gcc (originally standing for Gnu C Compiler, but now standing for Gnu Compiler Collection), N=7 and M~30 so the savings is considerable.Multiple PhasesThe front and back end are themselves each divided into multiple phases. The input to each phase is the output of the previous. Sometime a phase changes the representation of the input. For example, the lexical analyzer converts a character stream input into a token stream output. Sometimes the representation is unchanged. For example, the machine-dependent optimizer transforms target-machine code into (hopefully improved) target-machine code.The diagram is definitely not drawn to scale, in terms of effort or lines of code. In practice the optimizers, especially the machine-dependent one, dominate.Conceptually, there are three phases of analysis with the output of one phase the input of the next. The phases are called lexical analysis or scanning, syntax analysisor parsing, and semantic analysis.

Lexical Analysis in Compiler Design

Lexical Analysis (or Scanning)The character stream input is grouped into meaningful units called lexemes, which are then mapped into tokens, the latter constituting the output of the lexical analyzer. For example, any one of the following x3 = y + 3; x3 = y + 3 ; x3 =y+ 3 ; but not x 3 = y + 3; would be grouped into the lexemes x3, =, y, +, 3, and ;.A token is a <token-name,attribute-value> pair. For exampleThe lexeme x3 would be mapped to a token such as <id,1>. The name id is short for identifier. The value 1 is the index of the entry for x3 in the symbol table produced by the compiler. This table is used to pass information to subsequent phases.The lexeme = would be mapped to the token <=>. In reality it is probably mapped to a pair, whose second component is ignored. The point is that there are many different identifiers so we need the second component, but there is only one assignment symbol =.The lexeme y is mapped to the token <id,2>The lexeme + is mapped to the token <+>.The lexeme 3 is somewhat interesting and is discussed further in subsequent chapters. It is mapped to <number,something>, but what is the something. On the one hand there is only one 3 so we could just use the token <number,3>. However, there can be a difference between how this should be printed (e.g., in an error message produced by subsequent phases) and how it should be stored (fixed vs. float vs double). Perhaps the token should point to the symbol table where an entry for this kind of 3 is stored. Another possibility is to have a separate numbers table.The lexeme ; is mapped to the token <;>.Note that non-significant blanks are normally removed during scanning. In C, most blanks are non-significant. Blanks inside strings are an exception.Note that we can define identifiers, numbers, and the various symbols and punctuation without using recursion 

Syntax Analysis in Compiler Design

Syntax Analysis (or Parsing)Parsing involves a further grouping in which tokens are grouped into grammatical phrases, which are often represented in a parse tree. For example x3 = y + 3; would be parsed into the tree on the right.This parsing would result from a grammar containing rules such as asst-stmt → id = expr ; expr → number | id | expr + expr Note the recursive definition of expression (expr). Note also the hierarchical decomposition in the figure on the right.The division between scanning and parsing is somewhat arbitrary, but invariably if a recursive definition is involved, it is considered parsing not scanning. Often we utilize a simpler tree called the syntax tree with operators as interior nodes and operands as the children of the operator. The syntax tree on the right corresponds to the parse tree above it.(Technical point.) The syntax tree represents an assignment expression not an assignment statement. In C an assignment statement includes the trailing semicolon. That is, in C (unlike in Algol) the semicolon is a statement terminator not a statement separator.

Semantic Analysis in Compiler Design

Semantic AnalysisThere is more to a front end than simply syntax. The compiler needs semantic information, e.g., the types (integer, real, pointer to array of integers, etc) of the objects involved. This enables checking for semantic errors and inserting type conversion where necessary.For example, if y was declared to be a real and x3 an integer, we need to insert (unary, i.e., one operand) conversion operators “inttoreal” and “realtoint” as shown on the right.

Intermediate Code Generation in Compiler Design

Intermediate code generationMany compilers first generate code for an “idealized machine”. For example, the intermediate code generated would assume that the target has an unlimited number of registers and that any register can be used for any operation. Another common assumption is that machine operations take (up to) three operands, two source and one target.With these assumptions one generates three-address code by walking the semantic tree. Our example C instruction would producetemp1 = inttoreal(3) temp2 = id2 + temp1 temp3 = realtoint(temp2) id1 = temp3 We see that three-address code can include instructions with fewer than 3 operands.Sometimes three-address code is called quadruples because one can view the previous code sequence asinttoreal temp1 3 -- add temp2 id2 temp1 realtoint temp3 temp2 -- assign id1 temp3 -- Each “quad” has the form operation target source1 source2

Code Optimization and Code Generation

Code optimizationThis is a very serious subject, one that we will not really do justice to in this introductory course. Some optimizations are fairly easy to see.Since 3 is a constant, the compiler can perform the int to real conversion and replace the first two quads with add temp2 id2 3.0 The last two quads can be combined into realtoint id1 temp2 Code generationModern processors have only a limited number of register. Although some processors, such as the x86, can perform operations directly on memory locations, we will for now assume only register operations. Some processors (e.g., the MIPS architecture) use three-address instructions. However, some processors permit only two addresses; the result overwrites one of the sources. With these assumptions, code something like the following would be produced for our example, after first assigning memory locations to id1 and id2. LD R1, id2 ADDF R1, R1, #3.0 // add float RTOI R2, R1 // real to int ST id1, R2