Lexical Analysis
BNF Grammars for Tokens
- Lexical Analysis - Purpose is to recognize tokens in language
- Tokens are words of a language
- Distinguish between
- Tokens - categories of words
- Lexemes - text of words
Regular Grammars
- Tokens can be described using Regular Grammars
- In a Regular Grammars, the RHS of every production is either a terminal or a terminal and a variable
- Example Regular Grammar:
N -> a N
-> a
EBNF Grammars for Tokens
EBNF Grammars for Tokens
- EBNF Grammar for tokens
- Defines same language as given above
<ident> --> <letter> {<character>}
<letter> --> a | b ... | z | A ... | Z
<character> --> <letter> | <digit> | _
<digit> --> 0 | 1 | 2 | ... | 9
<number> --> <digit> {<digit>}
<symbol> --> + | - | * | /
| ( | ) | := | ;
| < | <= | > | >= | = | /=
Scanners
- Scanners
- Repeatedly called by parser to provide tokens to the parser
- AKA Lexical analyzer
- Tokens described by regular grammars
- Simplicity of regular grammar makes scanners simpler than parsers
- Scanner actions are easier to think about if grammar is
tranformed into a finite state machine (FSM)
- Scanners are based on finite state machines
- Scanners can be hand coded or table driven (we look at both)
Finite State Machines
- Used to describe a program that can be in a set of states
- State gives an indication of what's been seen so far in the
input
- Program responds to input differently based on what state it
is in.
- Response includes an action (eg output) and transition to next state
- Finite state machines are made up of
- Set of States
- Transitions between states that are caused by input
- Each transitions can have an associated Outputs and/or Actions
- Graphical representation is common
- States are ovals
- Arcs are transitions between states caused by inputs
- Arcs are labeled with input (and perhaps output)
that causes the transition
- Examples: Identifiers, numbers, other symbols
Hand Coded Implementation of FSM
Table Driven Scanner
- A table driven parser encodes the FSM states and actions into a
table:
- Rows: states
- Columns: Inputs (or input categories)
- Entries: Next state/action
- Actions include:
- add: Add current character to lexeme
- get: Advance to next character in input
- none: Don't change lexeme and don't advance input
- Table for Example 4.1:
State |
LETTER |
DIGIT |
UNKNOWN_CHAR |
START |
ID/add;get |
INT/add;get |
UNKNOWN_STATE/none |
ID |
ID/add;get |
ID/add;get |
DONE/none |
INT |
DONE/none |
INT/add;get |
DONE/none |
UNKNOWN_STATE |
DONE/none |
DONE/none |
DONE/none |
- Code for Example 4.1
function lex return token and lexeme
curState := START
curCharClass := getChar
-- initialize token
while curState /= DONE loop
nextState := table(curState, curCharClass).nextState
action := table(curState, curCharClass).action
-- do action (ie add and get)
curState := nextState;
end loop
-- lookup lexeme if needed
-- return token and lexeme
end lex