ITEC 420 - Section 1.3 - Regular Expressions
Regular Expressions - Lecture Topics
- Regular Expressions: The Big Picture
- Arithmetic Expression: 2 + 3
- Regular Expression: (0 ∪ 1)0*
- Shorthand: 0 for {0}, concatenation
- Precedence: star, concatenation, union
- Formal definition
- Difference between ε and φ
- R+ = RR*
- L(R) is the language described by RE R
- Examples
- Lexical analyzer of compiler
- Theorem 1.54: A language is regular iff some RE describes it.
Regular Expressions: The Big Picture
- RL = DFA = NFA (Do you know what this means?)
- Regular Operations operate on Regular Languages (ie take 1 or 2 RL and
create a new RL)
- We will use RO to define Regular Expressions
- Regular Expressions are a compact way to specify RL
Regular Operations: Review
- Arithmetic Operators:
- Combine numbers to create new number
- Example: 2 + 3 → 5
- Regular Operators:
- Combine regular languages to create new regular language
- Example: {0} ∪ {1} → {0,1}
- Operations: union, concatenation, star
- Example: {1,2} ο {3,4} → {13, 23, 14, 24}
- Example: {1,2}* → {ε, 1, 2, 11, 12, 21, 22, 111, ...}
Regular Expressions: Motivation
- Regular languages are heavily used in programming languages
- Example: Identifiers: letter followed by 0 or more letters or digits
- Σ = {a, b, c, ..., y, z, 0, 1, ..., 9}
- FSM = ...
- Regular Expressions = Compact notation for specifying Regular Languages
Regular Expressions: Alphabet Symbols, ∪, and Star
- What regular language does this regular expression describe
- Example 1: 0 ∪ 1
- Answer 1: L(0 ∪ 1) = {0} ∪ {1}
- L(R) is the language of Regular Expression R
- L(0) = {0}
- For alphabet symbol s, we use ss as a RE describing language {2}
- If Σ = {a, b}, then RE a describes {a}
- If Σ = {a, b}, then RE b describes {b}
- A ∪ combines two regular expressions to form a new RE
Regular Expressions: Star
- What regular language does this regular expression describe
- Example 1: 0*
- Answer 1: L(0*) = {0}*
- 0* describes {0}* (the set of all strings of character 0)
- If R is a RE, then R* is a new regular expressions
- More examples:
- Example 2: (0* ∪ 1*)
- Answer 2: L(0* ∪ 1*) = {0}* ∪ {1}*
- Example 3: (0 ∪ 1)*
- Answer 3: L((0 ∪ 1)*) = {0,1}*
- Shorthand: R+ = RR*
Regular Expressions: Concatenation
- What language does this RE describe?
- Example 1: 0 o 1
- Example 1: 0 ο 1
- Answer 1: L(0 o 1) = {01}
- A ο combines two regular expressions to form a new RE
- Concatenation is frequently written without the ο
- More examples:
- Example 1 (revisited): L(01) = {01}
- Example 2: (0 ∪ 1)0 describes {01, 11}
- Example 3: (0 ∪ 1)(0*) describes ({0}+) ∪ ({1} ο ({0}*))
Regular Expressions: Precedence and Parentheses
- Precedence: star, concatenation, union (high to low)
- What languages are described by each of these RE?
- 0 ∪ 10*
- (0 ∪ 1)0*
- 0 ∪ (10*)
- Parentheses can be omitted if precedence is what you want
Regular Expressions: ε and ∅
- Regular Expression ε describes language {ε}
- Regular Expression ∅ describes language {}
Symbols, Strings, and Regular Expressions
- We use the same character (eg 0) to represent:
- a symbol (ie 0)
- a string (ie 0)
- a regular expression (ie 0)
- Normally clear from context which is meant
- Some books use bold fonts to distinguish
Σ*
- If Σ is an alphabet, then we also use Σ as a regular expression
- Σ = s1 ∪ s2 ∪ ... ∪ sn, where s1, ..., sn are the symbols of alphabet Σ
- As a result, we use the symbol Σ to represent any of the following:
- an alphabet (ie a set of symbols)
- A regular expression that is the union of the REs of the alphabet symbols)
- the language that is the set of all strings described by the RE Σ
- Example: For alphabet Σ = {a, b} [set of symbols]
- Regular Expression Σ = a ∪ b [RE ∪ RE]
- Language Σ = L(a ∪ b) = {a, b} [set of strings]
Regular Expressions: Identities
- For any regular expression R:
- R ∪ ∅ = ?
- Rε = R o ε = ?
- R ∪ ε = ???
- R∅ = R o ∅ = ???
- R+ ∪ ε = ???
Regular Expressions: Examples
- Example 1.53, page 65
- Tokens (p. 66)
Regular Expressions: Formal Defintion
-
R is a Regular Expression if R is
- a for some a in the alphabet Σ
- ε
- ∅
- (R1 ∪ R2), where R1 and R2 are regular expressions
- (R1 o R2), where R1 and R2 are regular expressions
- (R1*), where R1 is a regular expression
- Definition - For RE R, L(R) [ie the language described by R] is
- L(a) = {a}, for a in alphabet Σ
- L(ε) = {ε}
- L(∅}) = {} = ∅ (ie language described by ∅ is
the empty language)
- L(R1 ∪ R2) = L(R1) ∪ L(R2)
- L(R1 o R2)= L(R1) o L(R2)
- L(R1*) = (L(R1))*
- This is a recursive/inductive definition
Using the Formal Defintion - Example
- What language is 1*(01*)*
Regular Expressions and Regular Languages
- Theorem 1.54: A language is regular iff some RE describes it
- Must prove both directions
- Lemma 1.55: If a language is described by a RE, then it is regular
- Lemma 1.60: If a language is regular, then it is described by a RE
- Which direction is easier?
Regular Expressions to NFA
- Lemma 1.55: If a language is described by a RE, then it is regular
- Proof idea: Build machines for RE based on clauses of definition:
- Any RE is one of the 6 kinds:
- Definition clauses 1 - 3: simple machines
- Definition clauses 4 - 6: combine machines as we did with regular operations
- Examples 1.56 and 1.58 (pp. 68-9)
NFA to Regular Expressions
- Lemma 1.60: If language L is regular, then L is described by a RE
- Proof idea:
- Convert the DFA into a GNFA (Generalized NFA)
- The GNFA will accept the language L
- Transform the GNFA to remove states until it contains only 2 states
- The transition between these states contains a RE that describes language L
GNFA
- Transitions use REs instead of just symbols
- The RE on the transition from A to B describes strings that move from A to B
- Many different strings could cause a transition between 2 states
- Example 1.61, p. 70
- Special Form:
- Start state: no arrows in
- Accept state: only 1 accept state, no arrows out
- Exactly one transition between every pair of states
(except for the start and accept states)
- Intuition on transition:
- DFA: input x takes you from state A to state B
- Every possible direct transition from A to B describes a set of strings
- Think of the RE on the transition from A to B as representing
the strings that take you from A to B
- Notice that indirect A to B transitions describe different sets of strings
Convert NFA to GNFA
- It's easy to convert an NFA to a GNFA:
- Add a new start state
- Add a new accept state
- Fix arrows so that there is exactly one between (almost) all pairs of states:
- Start state: only out arrows
- Final state: only in arrows
- Add union arrows where there were multiple arrows
- Add ∅ arrows where there are no arrows (can be
omitted in diagram, for simplicity)
GNFA: Formal Definition
-
A GNFA is a 5-tuple (Q, Σ, δ, qstart,
qaccept)
- Q is a finite set of states
- Σ is the input alphabet
- δ: (Q-{qaccept}) ×
{Q-{qstart}) → R
is the set of transitions
(where R is the set of all RE)
- qstart is the start state
- qaccept is the accept state
- Computation with a GNFA:
- G accepts w = w1w2...wk for strings w1...wk if computation
goes through states s0, s1, ..., sk and wi in L(δ(si-1,si))
GNFA: Removing States
- Basic process: Remove a state and modify RE on affected
transition
- Figure 1.63, p. 72
- Choose any state (except start or except state) to remove (ie qrip)
- For all pairs (qi, qj) such
that there is a
R1
and R3, modify (or create if needed)
the transition
fromqi
to qj
- qi and qj can be the same state
- Figure 1.67, pg. 75
- Figure 1.69, pg. 76
- Procedure Convert(G)
- Convert: GNFA → RE
- Convert is called recursively on a machine with one
fewer state
Proof of Claim 1.65
- Claim 1.65: Convert(G) is equivalent to G (ie
L(Convert(G)) = L(G))
- Inductive proof on number of states in G
- Basis: k = 2 states:
The RE on the transition describes all
strings that get G to an accept state,
thus L(Convert(G)) is equivalent to L(G) (ie the strings
that take Convert(G) to the accept state are the strings
generated by the RE on the single transition of G)
- Induction step:
- Assume claim is true when G has
k-1 states
- First we show that L(G) = L(G')
- Now, if G accepts input w
then G' accepts w whether or not the
computation contains qrip (if it
does not, each transition of G' contains the
the corresponding transition of G in a union;
if it does,
then the bracketing states have a new RE
that takes strings between them in G).
Thus, L(G) ⊆ L(G')
- Now suppose G' accepts input w
then G accepts w. Any RE in the
computation describes the strings that go
between the same two states in G, and so
G must
also accept w. Thus L(G') &sube
L(G)
- Thus, L(G) = L(G') and hence
G and G' are equivalent
- By the inductive assumption,
L(G') = L(Convert(G')).
Finally, when G has more than 2 states,
Convert(G) returns Convert(G') and so
L(Convert(G)) = L(Convert(G')) = L(G') = L(G).
Conclusion
- Thus we have proved the following:
- Claim 1.65: L(G) = L(Convert(G))
- Lemma 1.60: If a language is regular,
then it is described by a RE
- Lemma 1.55: If a language is described by a RE, then it is regular
- Theorem 1.54: A language is regular iff some RE describes it
ITEC 420 Course Page,
Last modified on