ITEC 460 - Chapter 2 Notes
Introduction
- Lexical Analyzer = Lexer = Scanner
- Each call returns a token and a value (ie lexeme) [Figure and Table 2.1]
- Skip white space and comments
Basic Techniques
- Brute Force
- Finite state machine (FSM) = Deterministic Finite Automata (DFA)
- Table driven
- Scanner Generator: Regular Expressions → Scanner
- Regular Expression = textual representation of a set of strings
Brute Force
- Lots of nested ifs
- Figure 2.2
- Not feasible for non-toy languages
- Need some theory: finite state machines, regular expressions
Buzz Words: Alphabet, Strings, Empty String, Language
- Alphabet - Σ : set of symbols
- String: sequence of symbols from Σ - written without commas
- String Length: number of symbols in the string
- Empty string: string of length 0
- How to represent?
- Epsilon: ε
- Language: Set of strings
Our Goal
- At this point, our goal is to
- Develop a formal mechanism for defining the valid words in a language
- Associated Subgoal: technique for recognizing them
- Tools: Finite State Machines, Regular Expressions
Finite State Machine
- Machine:
- Set of Symbols - Σ - Set of legal symbols
- Set of States: Q
- Designated Start State: q0
- Set of Final States: F
- Transition Function: δ : Q × Σ → Q
- Transition for each input symbol on each state not required
Finite State Machine: Representation
- Circles = States
- Arrows = Transitions
- Start state, input symbol, end state
- Possibly add actions
- Initial state has arrow in
- Each final state has a double circle
- Example Figures 2.7 and 2.8
Finite State Machine: Computation
- Process on Input: accept or reject an input string (ie is it a valid string?)
- Computation starts in Start State
- Each input symbol moves computation to next state
-
If no transition defined, reject the string
- At end of input, if computation is in an accept state, accept the string
- Otherwise reject the string
- Examples: Figs 2.9, 2.10, 2.12
- Deterministic: transitions have only one choice for each input and
they must consume input
- FYI, Nondeterministic FA can have a choice
on an input and can make a transition without consuming input
- Also, in 420 we prove that NFA are equivalent to DFA
Finite State Machine: Table Driven
- Left: current state
- Top: input symbol
- Entries: next state
- Examples: Parsons p24, p25
Where does the Table Come From?
- RE → NFA → DFA → Table
- RE = Regular Expressions
- NFA = Nondeterministic Finite Automata
- DFA = Deterministic Finite Automata
- Or use JavaCC, or SableCC, or Lex, or Flex, or JFlex, or ...
- We will first look at RE and JavaCC, then come back and look at RE → NFA → DFA
Regular Expressions (RE)
- Remember: Goal is to specify words in a language
- Textual tool for specifying a language for
- To define RE, we first define operations on languages: language
concatenation, Kleene closure
String Concatenation
- To define language concatenation, we first define string concatenation
- Concatenation of two strings: join the two strings
- Notation: x ο y
- Example: fire ο truck = firetruck
- Empty string: For all strings x, x ο ε =
x
Language Concatenation
- Concatenation of two LANGUAGES: join each pair of strings
- Notation: L1 ο L2
- Example:
- L1 = { ab, cd}
- L2 = { de, fg}
- L1 ο L2 = {abde, abfg, cdde, cdfg}
- L1 ο L1 = {abab, abcd, cdab, cdcd}
- Definition: Li = L ο Li - 1
Kleene Closure
- Definition: L* = {ε ∪ L ∪ L2 ∪
... }
- Named for mathenatician Kleene
- Examples: L1*, L2*
Regular Expressions: Definition
- Remember we use a RE to describe a language
- Remember that a language is a set of strings
- Recursive Definition:
- Each symbol s → { s}
- ε → {ε}
- Concatenation: For any two RE α and β, α ο
β is a RE
- Choice: For any two RE α and β, (α | β) is a RE
- Closure: For any RE α, α * is a RE
- Examples: Table 2.2
RE Shorthand
- RE are common in editors, scripting languages, JavaCC, Lex, ..., but notations differ slightly
- Drop ο
- Drop unnecessary parens (Precedence: Closure high, union low)
- Use [] for one of a set, eg [abcd], and - for ranges
- JavaCC: ["a", "b", "c", "d"] or ["a"-"d"]
- JavaCC: ["a"-"z", "0"-"9"]
- Use [^] for one NOT in a set [^abcd]
- JavaCC - different symbol and location: ~["a", "b", "c", "d"] or ~["a"-"d"]
- ? means optional, ie a? means (a|ε)
- + means one or more, ie for RE a, a+ means
(a ο a*)
- Strings in quotes mean those strings, eg "|" means vertical bar, not
choice
- White space is ognored (except within quotes)
RE Examples
- Examples: Table 2.3
- Examples: (a|b)*, a[bc]*d
- Examples of Java tokens: Table 2.4
JavaCC
- A Simple Example: simple.jj
- And a client to test it: TestJCC.java
- Remember: First token to match is used
- Remember: Longest token to match is used
Running JavaCC
- We need to run javacc on
simple.jj
- Run with:
java -classpath javacc.jar javacc simple.jj
- Download javacc from here
- Assumes that javacc.jar is in current directory
- You can also add javacc.jar to your system classpath
and run with
java javacc simple.jj
- New: Javacc is now installed on rucs and it can be
run with
javacc simple.jj
If you want to run javacc directly, you can set the
classpath environment variable as shown in the
javacc script at
/usr/local/gnu/bin/javacc
Files Produced by JavaCC
Now Use TestJCC to Test the Lexer
- Compile TestJCC - this causes compilation of the javacc output
javac TestJCC.java
This causes compilation of the javacc output
Now run:
java TestJCC somefile
Assume that somefile
contains some text to tokenize
Let's Experiment with simple.jj
- Try moving IDENTIFIER
- Try putting the left paren in the wrong place in IDENTIFIER
- Try without SKIP
- Pay attention to file dependencies
Skipping Comments
SKIP:
{
<"--" (~["\n"])* "\n">
}
Skipping Java bracketed comments (non-nesting)
- A RE is possible, but complex. An extra state is easier
- Switch to state IN_COMMENT when see "/*"
- Switch back to state DEFAULT when see "*/"
- Figure 2.16
- Handling nested comments
Skipping SimpleJava comments (bracketed and nesting)
- More complex - see assignment
Actions - Counting Comments
- SKIP and TOKEN rules can have associated actions
- The actions will be taken after the rule is met
- Actions can use declarations in the TOKEN_MGR_DECLS block
- Example - Count comments
- RemoveComments.java
More Capabilities of Tokens
-
public Token next
- the caller of
getNextToken() can use this to create a linked list of tokens
-
public Token specialToken
- linked list of
all special tokens that appear immediately before the current
token. See below.
- MORE Rules - Skipped and then prepended to current token
image. See example pages 30 and 31.
Special Tokens
- Not returned by getNextToken(), but saved
- See example page 29
Project
- Complete the project given in Section 2.5 on page 30 to create a lexer
for SimpleJava.