Saturday, June 16, 2007

Compilation basics

Introduction:-

Compiling refers to the act of turning human readable
source code into machine readable binary code and compiler is one such tool which makes it happen.
In other words the compiler is a program that translates a program in a language we understand into a language the computer understands.

A more technical definition of compilation is "Translation of a program written in a source language into a semantically equivalent program written in a target language"










Studying compilers is important for two main reasons

1. Knowledge about how compilers work can be used to write more efficient code in a highlevel language.

2. To be more effective users of compilers instead of treating a compiler as a “black box”.


Code Translation Process

















Preprocessor:-

· Preprocessing

– Occurs before a program is compiled
– Inclusion of other files
– Definition of symbolic constants and macros
– Conditional compilation of program code
– Conditional execution of preprocessor directives

· Format of preprocessor directives

– Lines begin with #
– Only whitespace characters before directives on a line
– Example Constants #define PI 3.14
– Example Macro #define CIRCLE_AREA( x ) ( PI * ( x ) * ( x ))

· Conditional compilation

– Control preprocessor directives and compilation
– Structure similar to if
#if !defined( NULL )
#define NULL 0
#endif

Note: Conditional compilation can be used to provide an on/off switch for the debug, as well as influencing the debug information that is produced.

Compiler:-


Phases in a compiler

1. Lexical Analysis
2. Syntax Analysis
3. Semantic Analysis
4. Code Generation
5. Code Optimization

Lexical Analysis:

Lexical analysis is the process of converting a sequence of characters into a sequence of tokens. Programs performing lexical analysis are called lexical analyzers or lexers. A lexer consists of a scanner and a tokenizer (also called parser).












• Each time the parser needs a token, it sends a request to the scanner
• the scanner reads as many characters from the input stream as necessary to construct a single token
• when a single token is formed, the scanner is suspended and returns the token to the parser
• the parser will repeatedly call the scanner to read all the tokens from the input stream

Scanner:

• A typical scanner:

– recognizes special characters, such as ( and ), or groups of special
characters, such as := and ==
– recognizes identifiers, integers, reals, decimals, strings, etc
– ignores whitespaces (tabs, blanks, etc) and comments
– recognizes and processes special directives (such as the #include "file"
directive in C) and macros

Parser:

Parsing (more formally syntactical analysis) is the process of analyzing a sequence of tokens to determine its grammatical structure with respect to a given formal grammar. A parser is the component of a compiler that carries out this task.

Example:

The expression is transformed into a structured representation (a Parse tree) which can then be used to evaluate or manipulate the expression











Semantic analysis:


Semantic analysis is the phase in which the compiler adds semantic information to the parse tree and builds the symbol table. This phase performs semantic checks such as type checking (checking for type errors), or object binding (associating variable and function references with their definitions), or definite assignment (requiring all local variables to
be initialized before use), rejecting incorrect programs or issuing warnings.

Code Optimization:

Code optimization is the process of tuning the output of a compiler to minimize some attribute (or maximize the efficiency) of an executable program. The most common requirement is to minimize the time taken to execute a program; a less common one is to minimize the amount of memory occupied

Example:
In the expression "(a+b)­(a+b)/4", "common sub expression" refers to the duplicated "(a+b)". Compilers implementing this technique realize that "(a+b)" won't change, and as such, only calculate its value once.

Code generation:

Code generation is the process by which a compiler's code generator converts a syntactically correct program into a linear sequence of instructions that could be executed by a machine.

Assembler:

Assembler is used to translate assembly language statements into the target computer's machine code. The assembler performs a more or less isomorphic translation (a one to one mapping) from mnemonic statements into machine instructions and data. (This is in contrast with highlevel languages, in which a single statement generally results in many machine instructions).

The assembler code is the input in form of push and pop and o/p is in form of the target object code.









Linker:

Linking is the process of combining various pieces of code and data together to form a single executable that can be loaded in memory
Link editors are commonly known as linkers. The compiler automatically invokes the linker as the last step in compiling a program. The linker inserts code (or maps in shared libraries) to resolve program library references, and/or combines object modules into an executable image suitable for loading into memory. On Unixlike systems, the linker is
typically invoked with the ld command.

Dynamic linking is accomplished by placing the name of a sharable library in the executable image. Actual linking with the library routines does not occur until the image is run, when both the executable and the library are placed in memory. An advantage of dynamic linking is that multiple programs can share a single copy of the library.

Static linking is the result of the linker copying all library routines used in the program into the executable image. This may require more disk space and memory than dynamic linking, but can be more portable, since it does not require the presence of the library on the system where it is run.

Static linking versus Dynamic Linking

--Dynamic produces smaller executable files.
--Dynamic consumes less memory.
--Dynamic runs more slowly.
--Dynamic has more system calls.
--Dynamic need more loading time.

Loader:

Copies a program from secondary (disk) storage to main
memory so it's ready to run. In some cases loading just involves
copying the data from disk to memory, in others it involves
allocating storage, setting protection bits, or arranging for virtual
memory to map virtual addresses to disk pages.
Copies a program from secondary (disk) storage to main.

Types of compilers

Broadly classified in 2 types:
· Onepass compiler
· Multipass compiler

Onepass compiler is a type of compiler software that passes through the source code of each compilation unit only once. In a sense, a onepass
compiler can't 'look back' at code it previously processed. Another term sometimes used is narrow compiler, which emphasizes the limited scope a onepass compiler is obliged to use. This is in contrast to a multipass
compiler which traverses the source code and/or the abstract syntax tree several times, building one or more intermediate representations that can be arbitrarily refined.















Summary:

Process of compilation is divided into various stages with each phases meant to do a particular job. Dividing the compilation in various phases not only gives compilers an advantage of modularity also makes them easily maintainable.