Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Lexical Analysis and Token Recognition, Slides of Compiler Design

This content covers the following topics: Role of lexical analyzer Specification of tokens Recognition of tokens Lexical analyzer generator Design of lexical analyzer generator

Typology: Slides

2021/2022

Available from 09/23/2023

avishek-1
avishek-1 🇮🇳

4 documents

1 / 56

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lexical Analysis
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38

Partial preview of the text

Download Lexical Analysis and Token Recognition and more Slides Compiler Design in PDF only on Docsity!

Lexical Analysis

Overview

Role of lexical analyzer

Specification of tokens

Recognition of tokens

Lexical analyzer generator

Design of lexical analyzer generator

Why separate Lexical analysis and parsing

  1. Simplicity of design
  2. Improving compiler efficiency
  3. Enhancing compiler portability

Tokens, Patterns and Lexemes

A token is a pair: a token name and an optional token value A pattern is a description of the form that the lexemes of a token may take A lexeme is a sequence of characters in the source program that matches the pattern for a token Ex: the input 31 + 28 + 59 is transformed into the sequence <num, 31> <+> <num, 28> <+> <num, 59>

Lexical errors

Some errors are out of power of lexical

analyzer to recognize as fi is a valid lexeme.

fi (a == f(x)) …

However it may be able to recognize errors

like:

d = 2r

Such errors are recognized when no pattern

for tokens matches a character sequence.

Error recovery

Panic mode: successive characters are ignored

until we reach to a well formed token

Delete one character from the remaining input

Insert a missing character into the remaining

input

Replace a character by another character

Transpose two adjacent characters (For

example , the string “ALISA” and “ALYSSA”

are separated by an edit distance of 2 to

transform 1 string to another)

Input buffering

In C language: we need to look after -, = or <

to decide what token to return; here it could

be start of two character operators like ==

There are many situations where we need to

look at least one additional character ahead

We need to introduce a two buffer scheme,

which are alternately reloaded, to handle

large look aheads safely

E = M * C * * 2 eof lexeme begin forward

Pointers in Input Buffer

Two pointers to the input are maintained:

1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are attempting to determine. 2. Pointer forward scans ahead until a pattern match is found. Once the next lexeme is determined, forward is set to the character at its right end. Then, after the lexeme is recorded as an attribute value of a token returned to the parser, lexemeBegin is set to the character immediately after the lexeme just found.

Specification of tokens

In theory of compilation regular expressions are used to formalize the specification of tokens Regular expressions are means for specifying regular languages Example: here x* indicates zero or more strings matching x Id -> letter_(letter_ | digit)* Each regular expression is a pattern specifying the form of strings a+ : one or more strings matching string a

. Any character but newline a.*b

Regular definitions

d1 -> r

d2 -> r

dn -> rn

 Example:

letter_ -> A | B | … | Z | a | b | … | z | _ digit -> 0 | 1 | … | 9 id -> letter_ (letter_ | digit)* Ref. table 3.8 in the text book. _ denotes concatenation

Recognition of tokens

Starting point is the language grammar to

understand the tokens:

stmt -> if expr then stmt | if expr then stmt else stmt | Ɛ expr -> term relop term | term term -> id | number

Recognition of tokens (cont.)

The next step is to formalize the patterns:

digit → [0-9] => 0|1| …| digits → digit+ number → digits optionalFraction optionalExponent (ex: 2340, 0.012, 2.32E4 or 1.89E-4) number → digits(. digits)? (E[+-]? digits)? letter → [A-Za-z_] id → letter (letter|digit)* if → if then → then else → else relop → < | > | <= | >= | = | < > We also need to handle whitespaces: ws → (blank | tab | newline)+

Transition diagrams

Transition diagram for relop

Notations: Single circle: state, Edge: transition, double circle: accepting state indicating detection of lexeme

  • retract Note: Start state enters from nowhere.

Transition diagrams (cont.)

Transition diagram for reserved words and

identifiers

Hypothetical transition diagram for keyword

then

Start (^0 1 2 3 4 )