ITEC 420  Section 1.3  Regular Expressions
Regular Expressions  Lecture Topics
 Review:
 Review: Our Objects so Far
 Review: Types
 Closure of Regular Languages
 Specifying a Language
 Regular Expressions: Defined
 Regular Expressions: Introduction
 Regular Expressions: Formal Definition
 Regular Expressions: Shorthand and Precedence
 Regular Expressions: Examples
 Regular Expressions: Identities
 Regular Expressions: Example Applications
 Regular Example and Regular Languages
 GNFA  Introduction
 GNFA  Specifying Form and Computation
 GNFA  Formal Definition
 GNFA  Removing States
 GNFA  Procedure Convert
 Proof of Claim 1.65
 Conclusion: RE ↔ RL
Review: Our Objects so Far
 Symbol, Alphabet, String
 Language
 Finite Automata:
 M = (Q, Σ, δ, q0, F)
 δ: Q × Σ → Q
 L(M) connects Langauges and Machines
 This operation connects Languages and Machines
 Regular Language
 NFA:
 δ: Q × Σ_{ε} → P(Q)
 Σ_{ε} = Σ ∪ { ε }
 DFA ↔ NFA ↔ Regular Language
Review: Types
 Type:
 A Type is a Set of Values and Set of Operations
 Types are Not in Book, but they are useful to think about
 Strings (over an alphabet):
 Values: all finite sequences of symbols of any length, including 0
 Operations:
 Length (eg abc = 3; if w = defg then w = 4
 Concatentation (eg xy for strings x and y )
 x^{k} (self concatenation; for k ≥ 0)
x^{0}
 Languages (over an alphabet):
 Values: Any set of strings
 Operations: Regular operations: ∪, o, *
Closure of Regular Languages
 Careful: Regular Operations are defined over ALL languages
 Why called regular operations?
 RL are connected to regular expresions (which, as we will see, define RL)
 Regular Languages are closed under Regular Operations
 Theorems: If A and B are regular languages, then
 A ∪ B is a regular language
 A ⚬ B is a regular language
 A* is a regular language
 Proved w/ NFA
How to Specify a Language
 List the elements: (eg {0, 00, 101})
 Describe in words  Examples:
 Language A is all strings over {a,b} with 2 a's
 B = { w  w starts and ends with 0}
 C = { w  w has the same number of 0's and 1's and
w < 5}
 D = { w  w has the same number of 0's and 1's }
 Define a Machine (DFA or NFA)
 Form from simpler languages using Regular Operations  Examples:
 New technique: Specify using a Regular Expression
Regular Expressions: Components
 Regular expressions are made up of the following components:
 Alphabet Symbols
 Union
 Star
 Concatenation
 ε
 φ
 We will look at each component in turn
Regular Expressions: Alphabet Symbols
 Regular Expressions can be the symbols of the alphabet
 Example 1: For alphabet Σ = {0, 1}
 Symbols 0 and 1 are Regular Expressions
 0 represents the language {0}
 1 represents the language {1}
 Example 2: What are the Regular Expressions for alphabet Σ = {a, b, c}
 Summary: For any alphabet symbol s, we use s as a RE describing language {s}
Language Described by a Regular Expressions
 We use L(R) to represent the language described by Regular Expression R
 Examples:
Regular Expressions: ∪
 Larger languages can be described using RE formed with ∪
 What regular language does regular expression 0 ∪ 1 describe
 Answer: L(0 ∪ 1) = {0} ∪ {1} = {0, 1}
 L(R1 ∪ R2) = L(R1) ∪ L(R2)
 Another example:
Regular Expressions: Star
 What regular language does this regular expression 0* describe
 L(R*) = (L(R))*
 More examples:
 L(0* ∪ 1*) = {0}* ∪ {1}*
 L((0 ∪ 1)*) = {0,1}*
Regular Expressions: Concatenation
 What language does 0 ⚬ 1 describe?
 L(R1 ⚬ R2) = L(R1) ⚬ L(R2)
 More examples:
 Example 1 (revisited): L(01) = {01}
 Example 2: L((0 ∪ 1)0) = L(0 ∪ 1) ⚬ L(0) = {0,1}
⚬ {0} = {00, 10}
 Example 3: (0 ∪ 1)(0^{*})
= L(0 ∪ 1) ⚬ ({0}*))
= {0 ∪ 1} ⚬ ({0}*)
= {0} ⚬ ({0}*) ∪ {1} ⚬ ({0}*)
= ...
 Concatenation is frequently written without the ⚬
Regular Expressions: Precedence and Parentheses
 Precedence: star, concatenation, union (high to low)
 What languages are described by each of these RE?
 0 ∪ 10^{*}
 (0 ∪ 1)0^{*}
 0 ∪ (10^{*})
 Parentheses can be omitted if precedence gives desired meaning
Regular Expressions: R^{+}
 Shorthand: R^{+} = RR^{*}
 Example: (0 ∪ 1)(0^{*})
= L(0 ∪ 1) ⚬ ({0}*))
= {0 ∪ 1} ⚬ ({0}*)
= {0} ⚬ ({0}*) ∪ {1} ⚬ ({0}*)
= 0^{+} ∪ 10*
Regular Expressions: ε and ∅
 Regular Expression ε describes language {ε}
 Regular Expression ∅ describes language {}
Symbols, Strings, and Regular Expressions
 We use the same character (eg 0) to represent:
 a symbol (ie 0)
 a string (ie 0)
 a regular expression (ie 0)
 Normally clear from context which is meant
 Some books use bold fonts to distinguish
 ε is either the empty string or a RE representing the language
containing only the empty string
 φ is either the empty set or a RE representing the empty language
 Which, of course, is simply the empty set
Σ^{*}
 If Σ is an alphabet, then we also use Σ as a regular expression
 Σ = s_{1} ∪ s_{2} ∪ ... ∪
s_{n}, where s_{1}, ..., s_{n} are the symbols of alphabet Σ
 As a result, we use the symbol Σ to represent either of the following:
 an alphabet (ie a set of symbols)
 A regular expression that is the union of the REs of the alphabet symbols
 Example: For alphabet Σ = {a, b} [set of symbols]
 Regular Expression Σ = a ∪ b [RE ∪ RE]
 For Alphabet Σ, we let Σ* be a RE that represents all strings over the alphabet Σ
Regular Expressions: Identities
 For any regular expression R:
 ε* = ?
 φ* = ?
 R ∪ ∅ = R
 R ∪ ε = ???
 Rε = R ⚬ ε = ?
 R∅ = R ⚬ ∅ = ???
 R^{+} ∪ ε = ???
Regular Expressions: Formal Defintion
 R is a Regular Expression if R is
 a for some a in the alphabet Σ
 ε
 ∅
 (R1 ∪ R2), where R1 and R2 are regular expressions
 (R1 ⚬ R2), where R1 and R2 are regular expressions
 (R1^{*}), where R1 is a regular expression
 Definition  For RE R, L(R) [ie the language described by R] is
 L(a) = {a}, for a in alphabet Σ
 L(ε) = {ε}
 L(∅}) = {} = ∅ (ie language described by ∅ is
the empty language)
 L(R1 ∪ R2) = L(R1) ∪ L(R2)
 L(R1 ⚬ R2)= L(R1) ⚬ L(R2)
 L(R1*) = (L(R1))*
 This is a recursive/inductive definition
Regular Expressions: More Examples
 What language is 1*(01*)*
 Example 1.53, page 65
 1*φ*
 0 ⚬ (0 ∪ 1)* ⚬ 0
 (a ∪ b) ⚬ a ⚬ (a ∪ b)* a ⚬ (a ∪ b)*
 Tokens (p. 66)
Regular Expressions: Example Application
 Tokens (p. 66)
 Regular languages and Regular Expressions are heavily used in
Programming
Language Theory
 In a language, the set of all possible tokens is a RL
 A token is essentially a word of the language (eg numbers,
identifiers, keywords, operators)
 Example: Numbers: digit followed by 0 or more digits (length?)
 Σ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
 FSM = ...
 Example: Identifiers: letter followed by 0 or more letters or digits
 Σ = {a, b, c, ..., y, z, 0, 1, ..., 9}
 FSM = ...
Regular Expressions as a Type
 Values: All possible regular expressions (over some Σ)
 Operations: ∪, o, *
 More Operations: ^{+}, L(R)
Regular Expressions Describe Regular Languages
 We prove that Regular Expressions describe the Regular
Languages, and only the RL
 Theorem 1.54: A language is regular iff some RE describes it
 The proof has two parts:
 Lemma 1.55: Every RE describes a RL
 Lemma 1.56: Every RL can be described by a RE
 Together, Lemmas 1.55 and 1.60 imply Theorem 1.54
Part 1: Languages Described by RE Are Regular
 Lemma 1.55: If a language is described by a RE, then it is regular
 Proof idea  Build a machine that accepts the language
of a given RE:
 Definition clauses 1  3: construct simple machines
 Definition clauses 4  6: use same approach as
when proving closure of regular operations
 Examples 1.56 and 1.58 (pp. 689)
 For clauses 46, proof uses induction on the length
of the RE
 Base cases: ???
 IH: Assume a RE of length k (for an
arbitrary k≥1) defines a regular language
 Now consider any RE of length k+1. It must be
one of three forms: ...
Part 2: Each Regular Language Can be Described by a RE
 Lemma 1.60: If a language is regular, then it is described by a RE
 Basic technique: Create a new kind of machine, a GNFA (Generalized NFA) that has the
same power as a NFA and that is suited to proving Lemma 1.60
 Proof idea  For each RL, convert its NFA into an equivalent RE:
 Regular Language L has an NFA, call it N
 Convert N into a GNFA (Generalized NFA) called G that accepts L
 Transform this G to find a RE that describes L:
while G has 2 or more states loop
remove from G one state and all its transitions
modify other transitions so still G accepts L
end loop
 G now has 2 states and 1 transition`
G's single transition has a RE that describes
N's language L
Part 2: GNFA Transitions
 GNFA Transitions have RE that
describe strings that take machine between states
 Language defined by machine is set of strings
described by transitions on path from start to final
states
 Example: transition between states 1 and 2 has
RE 01*0
 Strings 00, 010, 0110, ... take machine from states 1 to 2

Because the RE on the transition from 1 to 2 describes
these strings
Part 2: Simple Examples Modifying GNFA
 Example 1:
transition from states 1 to 2 has a
transition from states 2 to 3 has c.
Now remove state 2.
 Example 2, like Example 1, but with a loop labeled b on 2
 Actual modifications are more complicated because ALL
PATHS must be considered
 Example 3: Like Example 2, but add transition from
states 1 to 3 labeled with d. Now remove state 2
GNFA  Compared to NFA
 GNFA are based on NFA, but
 Transitions are labeled with REs rather than symbols
 Following a transition consumes a string, not just
a single symbol
 (Almost) every pair of states has a transition
 δ specifies the RE on transition between states
 GNFA G accepts n string w if a path
from initial to final states
contains RE that describe substrings of w (in order, of
course)
 Example 1.61, p. 70
GNFA  Special Form and Computation
 A GNFA has a special form:
 Start state: no arrows in
 Accept state: only 1 accept state, no arrows out
 All other pairs of states have arrows
 Delta (ie δ) returns the RE between two
states
 Delta does not map a state and an input symbol to a state
 Delta maps a pair of states to a Regular Expression
 δ: (Q{q_{accept}}) ×
(Q{q_{start}}) → R
 A computation moves between two states if the RE
between those states describes
the next substring of the input string
 A transition between two states consumes an input
(sub)string described by the RE on that transition
 If multiple paths between two states, then different
paths between these states will probably consume different
input strings
How to Convert an NFA to an Equivalent GNFA
 It's easy to convert an NFA N to an equivalent GNFA G:
 Add a new start state
 Add a new accept state
 Single arrows between states are left unchanged
 Symbols and ε on arrows become symbol
and ε REs
 Multiple arrows between a pair of states
are removed and replaced by a Union arrow
 Add ∅ arrows to G so that there are arrows
between all pairs of states (except to the start
state and from the final state)
 Example
GNFA: Formal Definition

A GNFA is a 5tuple (Q, Σ, δ, q_{start},
q_{accept})
 Q is a finite set of states
 Σ is the input alphabet
 δ: (Q{q_{accept}}) ×
{Q{q_{start}}) → R
is the set of transitions
(where R is the set of all possible RE over Σ)
 q_{start} is the start state
 q_{accept} is the accept state
 Computation with a GNFA:
 G accepts w =
w_{1}w_{2}...w_{k} for
strings w_{1} ... w_{k} if
 there is a computation that goes through states s_{0}, s_{1}, ..., s_{k}

and w_{i} is in L(δ(s_{i1}, s_{i}))
GNFA: Removing States
 Basic process:
 Remove a state and modify transitions so that language accepted does not change
 Repeat 1 until only 2 states remain
 Algorithm: (See Figure 1.63, p. 72)
 Choose any state (except the start or accept state) as the state to remove.
Call this state q_{rip}.
 For all pairs of states
(q_{i}, q_{j})
for which there is a pair of nonempty transitions
R_{1} and R_{3}, with
R_{1} ∈ δ(q_{i}, q_{rip}) and R_{3} ∈
δ(q_{rip},
q_{j}):
 modify delta to change the transition from q_{i} to q_{j}
 That is, modify δ(q_{i}, _{j})
Removing States: Notes on Modifying Delta
 For strings that go through qrip to get from qi to qj,
the modifications take those same strings directly from qi to qj
 The modifications do nothing to modify the set of
strings that go from
q_{i} to q_{j}
without going through q_{rip}
 Thus, removing q_{rip} does not change the language accepted by the machine
 The direct transition from q_{i}
to q_{j} may be φ
before removing q_{rip}
 qi and qj can be the same state
Example 1: Figure 1.67, pg. 75
Example 2: Figure 1.69, pg. 76
GNFA: Procedure Convert
 Procedure Convert converts a GNFA into an equivalent Regular Expression (ie one that
describes the language that the GNFA accepts)
 Convert: GNFA → RE
 Procedure Convert(G) formally defines the process of
 removing a state and
 modifying the resulting machine so that it it recognizes the same language
 For GNFA G, Convert(G) recursively calls Convert(G'), where GNFA
G' is G with one state removed
Proof of Claim 1.65
 Claim 1.65: For any GNFA G, L(Convert(G)) = L(G)
 That is, the language described by the RE returned by
Convert(G) is the same as the language recognized by the machine G
 We use an inductive proof on k, number of states in G
 Basis: k = 2 states:
When k=2, G has one transition.
The RE on that transition describes all
strings that take G from the initial state to the accept state.
These strings make up L(G). Furthermore, Convert(G) returns
the single RE on the single transtion of G.
Thus, L(Convert(G)) is the set of strings described by the RE on G's transition.
Thus L(Convert(G)) is equivalent to L(G) (ie the strings
that take Convert(G) to the accept state are the strings
generated by the RE on the single transition of G).
 Induction step:
 Inductive Hypothesis: Assume Claim 1.65 is true for any machine
that has k1 states, for an arbitrary k > 2.
 That is, if Machine H has k1 states,
then L(Convert(H))=L(H).
 In words, if Machine H has k1 states, then the RE produced by
Convert(H) describes the same language as machine H recognizes
 Now consider any machine G that has k states.
 First we show that L(G) = L(G'), where G' is the
modified machine. [That is, we show that both machines
recognize the same language]
 We start by showing that if G accepts string w, then
G' must also accept w (ie
that L(G) ⊆ L(G').
Now, if G accepts w, then there are two possibilities:
either G's computation on w does not contains q_{rip},
or it does contain q_{rip}:
 If the computation on G
does not contain q_{rip}, then
each transition on G in this computation can also be found on G',
possibly as part of a union, and thus G' has a computation that accepts w.
 If the computation on G does contain q_{rip},
then for any run of q_{rip}s in the computation on G, the
two states that bracket this run have a new RE in G' and this new RE
takes strings between the two states in G).
Thus, L(G) ⊆ L(G') (ie if G accepts w, then G' accepts w)
 Now we show that if G' accetps a string w,
then G also accepts w (ie
that L(G') ⊆ L(G).
Suppose G' accepts input w.
Then G accepts w because at any step in the
computation in G', the RE on the transition between the pair of
states in G'is constructed to describe the strings that go
between the same two states in G. Thus,
G must
also accept w. Therefore L(G') ⊆ L(G)
 Thus, from 3.1 and 3.2 we conclude that
L(G) = L(G'), that is
G and G' accept the same set of strings.
 Now, by the construction of G', if G has k states,
then G' has k1 states, and by the Inductive Hypothesis
we know that L(G') = L(Convert(G')).

Finally, when Convert is written so that if G has more than 2 states,
then Convert(G) returns Convert(G'). That is, by
definition, Convert(G) = Convert(G') which means that
L(Convert(G')) = L(Convert(G)).

Thus, from the three points above we have
 L(G)=L(G')
 L(G') = L(Convert(G'))
 L(Convert(G')) = L(Convert(G))
Finally, we can conclude that
L(G) = L(Convert(G)) , our desired result.
 We have shown that Claim 1.65 is true when G has 2 states. We have also shown
that when we
assume it is true when G has k1 states,
then we can show it is also true when G has
k states. Thus, Claim 1.65
is true, by induction, for all GNFA that have 2 or more states. This is what we wanted to
prove!.
Conclusion: RE ↔ RL
 Remember: Convert: GNFA → RE
 We have now proved the following:
 Claim 1.65: L(G) = L(Convert(G)) for GNFA G
 Lemma 1.60: If a language is regular,
then it is described by a RE
 This follows from Claim 1.65
 What is the chain of reasoning?
 Lemma 1.55: If a language is described by a RE, then it is regular
 Thus we conclude:
 Theorem 1.54: A language is regular iff some RE describes it
ITEC 420 Course Page,
Last modified on