# ITEC 420 - Section 1.3 - Regular Expressions

## Regular Expressions - Lecture Topics

• Review:
• Review: Our Objects so Far
• Review: Types
• Closure of Regular Languages
• Specifying a Language

• Regular Expressions: Defined
• Regular Expressions: Introduction
• Regular Expressions: Formal Definition
• Regular Expressions: Shorthand and Precedence
• Regular Expressions: Examples
• Regular Expressions: Identities
• Regular Expressions: Example Applications

• Regular Example and Regular Languages
• GNFA - Introduction
• GNFA - Specifying Form and Computation
• GNFA - Formal Definition
• GNFA - Removing States
• GNFA - Procedure Convert
• Proof of Claim 1.65
• Conclusion: RE ↔ RL

## Review: Our Objects so Far

• Symbol, Alphabet, String

• Language

• Finite Automata:
• M = (Q, Σ, δ, q0, F)
• δ: Q × Σ → Q
• L(M) connects Langauges and Machines
• This operation connects Languages and Machines

• Regular Language

• NFA:
• δ: Q × ΣεP(Q)
• Σε = Σ ∪ { ε }

• DFA ↔ NFA ↔ Regular Language

## Review: Types

• Type:
• A Type is a Set of Values and Set of Operations
• Types are Not in Book, but they are useful to think about

• Strings (over an alphabet):
• Values: all finite sequences of symbols of any length, including 0
• Operations:
• Length (eg |abc| = 3; if w = defg then |w| = 4
• Concatentation (eg xy for strings x and y )
• xk (self concatenation; for k ≥ 0) x0

• Languages (over an alphabet):
• Values: Any set of strings
• Operations: Regular operations: ∪, o, *

## Closure of Regular Languages

• Careful: Regular Operations are defined over ALL languages
• Why called regular operations?
• RL are connected to regular expresions (which, as we will see, define RL)
• Regular Languages are closed under Regular Operations

• Theorems: If A and B are regular languages, then
• A B is a regular language
• A B is a regular language
• A* is a regular language

• Proved w/ NFA

## How to Specify a Language

• List the elements: (eg {0, 00, 101})

• Describe in words - Examples:
• Language A is all strings over {a,b} with 2 a's
• B = { w | w starts and ends with 0}
• C = { w | w has the same number of 0's and 1's and |w| < 5}
• D = { w | w has the same number of 0's and 1's }

• Define a Machine (DFA or NFA)

• Form from simpler languages using Regular Operations - Examples:
• A B
• A B
• C*

• New technique: Specify using a Regular Expression

## Regular Expressions: Components

• Regular expressions are made up of the following components:
• Alphabet Symbols
• Union
• Star
• Concatenation
• ε
• φ

• We will look at each component in turn

## Regular Expressions: Alphabet Symbols

• Regular Expressions can be the symbols of the alphabet

• Example 1: For alphabet Σ = {0, 1}
• Symbols 0 and 1 are Regular Expressions
• 0 represents the language {0}
• 1 represents the language {1}

• Example 2: What are the Regular Expressions for alphabet Σ = {a, b, c}

• Summary: For any alphabet symbol s, we use s as a RE describing language {s}

## Language Described by a Regular Expressions

• We use L(R) to represent the language described by Regular Expression R

• Examples:
• L(0) = {0}
• L(a) = {a}

## Regular Expressions: ∪

• Larger languages can be described using RE formed with ∪

• What regular language does regular expression 0 ∪ 1 describe
• Answer: L(0 ∪ 1) = {0} ∪ {1} = {0, 1}

• L(R1 ∪ R2) = L(R1) ∪ L(R2)

• Another example:
• a ∪ b ∪ c = {a, b, c}

## Regular Expressions: Star

• What regular language does this regular expression 0* describe

• L(R*) = (L(R))*

• More examples:
• L(0* ∪ 1*) = {0}* ∪ {1}*

• L((0 ∪ 1)*) = {0,1}*

## Regular Expressions: Concatenation

• What language does 0 ⚬ 1 describe?
• Answer: L(0 ⚬ 1) = {01}

• L(R1 ⚬ R2) = L(R1) ⚬ L(R2)

• More examples:
• Example 1 (revisited): L(01) = {01}

• Example 2: L((0 ∪ 1)0) = L(0 ∪ 1) ⚬ L(0) = {0,1} ⚬ {0} = {00, 10}

• Example 3: (0 ∪ 1)(0*) = L(0 ∪ 1) ⚬ ({0}*)) = {0 ∪ 1} ⚬ ({0}*) = {0} ⚬ ({0}*) ∪ {1} ⚬ ({0}*) = ...

• Concatenation is frequently written without the ⚬

## Regular Expressions: Precedence and Parentheses

• Precedence: star, concatenation, union (high to low)

• What languages are described by each of these RE?
• 0 ∪ 10*
• (0 ∪ 1)0*
• 0 ∪ (10*)

• Parentheses can be omitted if precedence gives desired meaning

## Regular Expressions: R+

• Shorthand: R+ = RR*
• Example: (0 ∪ 1)(0*) = L(0 ∪ 1) ⚬ ({0}*)) = {0 ∪ 1} ⚬ ({0}*) = {0} ⚬ ({0}*) ∪ {1} ⚬ ({0}*) = 0+ ∪ 10*

## Regular Expressions: ε and ∅

• Regular Expression ε describes language {ε}
• Regular Expression ∅ describes language {}

## Symbols, Strings, and Regular Expressions

• We use the same character (eg 0) to represent:
• a symbol (ie 0)
• a string (ie 0)
• a regular expression (ie 0)

• Normally clear from context which is meant

• Some books use bold fonts to distinguish

• ε is either the empty string or a RE representing the language containing only the empty string

• φ is either the empty set or a RE representing the empty language
• Which, of course, is simply the empty set

## Σ*

• If Σ is an alphabet, then we also use Σ as a regular expression
• Σ = s1 ∪ s2 ∪ ... ∪ sn, where s1, ..., sn are the symbols of alphabet Σ

• As a result, we use the symbol Σ to represent either of the following:
• an alphabet (ie a set of symbols)
• A regular expression that is the union of the REs of the alphabet symbols

• Example: For alphabet Σ = {a, b} [set of symbols]
• Regular Expression Σ = a ∪ b [RE ∪ RE]

• For Alphabet Σ, we let Σ* be a RE that represents all strings over the alphabet Σ

## Regular Expressions: Identities

• For any regular expression R:
• ε* = ?

• φ* = ?

• R ∪ ∅ = R

• R ∪ ε = ???

• Rε = R ⚬ ε = ?

• R∅ = R ⚬ ∅ = ???

• R+ ∪ ε = ???

## Regular Expressions: Formal Defintion

• R is a Regular Expression if R is
1. a for some a in the alphabet Σ
2. ε
3. (R1 ∪ R2), where R1 and R2 are regular expressions
4. (R1 ⚬ R2), where R1 and R2 are regular expressions
5. (R1*), where R1 is a regular expression

• Definition - For RE R, L(R) [ie the language described by R] is
1. L(a) = {a}, for a in alphabet Σ
2. L(ε) = {ε}
3. L(∅}) = {} = ∅ (ie language described by ∅ is the empty language)
4. L(R1 ∪ R2) = L(R1) ∪ L(R2)
5. L(R1 ⚬ R2)= L(R1) ⚬ L(R2)
6. L(R1*) = (L(R1))*

• This is a recursive/inductive definition

## Regular Expressions: More Examples

• What language is 1*(01*)*

• Example 1.53, page 65

• 1*φ*

• 0 ⚬ (0 ∪ 1)* ⚬ 0

• (a ∪ b) ⚬ a ⚬ (a ∪ b)* a ⚬ (a ∪ b)*

• Tokens (p. 66)

## Regular Expressions: Example Application

• Tokens (p. 66)

• Regular languages and Regular Expressions are heavily used in Programming Language Theory

• In a language, the set of all possible tokens is a RL
• A token is essentially a word of the language (eg numbers, identifiers, keywords, operators)

• Example: Numbers: digit followed by 0 or more digits (length?)
• Σ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
• FSM = ...

• Example: Identifiers: letter followed by 0 or more letters or digits
• Σ = {a, b, c, ..., y, z, 0, 1, ..., 9}
• FSM = ...

## Regular Expressions as a Type

• Values: All possible regular expressions (over some Σ)

• Operations: ∪, o, *
• More Operations: +, L(R)

## Regular Expressions Describe Regular Languages

• We prove that Regular Expressions describe the Regular Languages, and only the RL

• Theorem 1.54: A language is regular iff some RE describes it

• The proof has two parts:
1. Lemma 1.55: Every RE describes a RL
2. Lemma 1.56: Every RL can be described by a RE

• Together, Lemmas 1.55 and 1.60 imply Theorem 1.54

## Part 1: Languages Described by RE Are Regular

• Lemma 1.55: If a language is described by a RE, then it is regular

• Proof idea - Build a machine that accepts the language of a given RE:
• Definition clauses 1 - 3: construct simple machines

• Definition clauses 4 - 6: use same approach as when proving closure of regular operations

• Examples 1.56 and 1.58 (pp. 68-9)

• For clauses 4-6, proof uses induction on the length of the RE
• Base cases: ???

• IH: Assume a RE of length k (for an arbitrary k≥1) defines a regular language

• Now consider any RE of length k+1. It must be one of three forms: ...

## Part 2: Each Regular Language Can be Described by a RE

• Lemma 1.60: If a language is regular, then it is described by a RE

• Basic technique: Create a new kind of machine, a GNFA (Generalized NFA) that has the same power as a NFA and that is suited to proving Lemma 1.60

• Proof idea - For each RL, convert its NFA into an equivalent RE:
• Regular Language L has an NFA, call it N

• Convert N into a GNFA (Generalized NFA) called G that accepts L

• Transform this G to find a RE that describes L:
• ```        while G has 2 or more states loop
remove from G one state and all its transitions
modify other transitions so still G accepts L
end loop

-- G now has 2 states and 1 transition`
```
• G's single transition has a RE that describes N's language L

## Part 2: GNFA Transitions

• GNFA Transitions have RE that describe strings that take machine between states

• Language defined by machine is set of strings described by transitions on path from start to final states

• Example: transition between states 1 and 2 has RE 01*0

• Strings 00, 010, 0110, ... take machine from states 1 to 2
• Because the RE on the transition from 1 to 2 describes these strings

## Part 2: Simple Examples Modifying GNFA

• Example 1: transition from states 1 to 2 has a transition from states 2 to 3 has c. Now remove state 2.

• Example 2, like Example 1, but with a loop labeled b on 2

• Actual modifications are more complicated because ALL PATHS must be considered

• Example 3: Like Example 2, but add transition from states 1 to 3 labeled with d. Now remove state 2

## GNFA - Compared to NFA

• GNFA are based on NFA, but
• Transitions are labeled with REs rather than symbols
• Following a transition consumes a string, not just a single symbol
• (Almost) every pair of states has a transition
• δ specifies the RE on transition between states
• GNFA G accepts n string w if a path from initial to final states contains RE that describe substrings of w (in order, of course)

• Example 1.61, p. 70

## GNFA - Special Form and Computation

• A GNFA has a special form:
• Start state: no arrows in
• Accept state: only 1 accept state, no arrows out
• All other pairs of states have arrows
• Delta (ie δ) returns the RE between two states
• Delta does not map a state and an input symbol to a state
• Delta maps a pair of states to a Regular Expression
• δ: (Q-{qaccept}) × (Q-{qstart}) → R

• A computation moves between two states if the RE between those states describes the next substring of the input string

• A transition between two states consumes an input (sub)string described by the RE on that transition

• If multiple paths between two states, then different paths between these states will probably consume different input strings

## How to Convert an NFA to an Equivalent GNFA

• It's easy to convert an NFA N to an equivalent GNFA G:
• Add a new start state

• Add a new accept state

• Single arrows between states are left unchanged
• Symbols and ε on arrows become symbol and ε REs

• Multiple arrows between a pair of states are removed and replaced by a Union arrow

• Add ∅ arrows to G so that there are arrows between all pairs of states (except to the start state and from the final state)

• Example

## GNFA: Formal Definition

• A GNFA is a 5-tuple (Q, Σ, δ, qstart, qaccept)
1. Q is a finite set of states
2. Σ is the input alphabet
3. δ: (Q-{qaccept}) × {Q-{qstart}) → R is the set of transitions
4. (where R is the set of all possible RE over Σ)
5. qstart is the start state
6. qaccept is the accept state

• Computation with a GNFA:
• G accepts w = w1w2...wk for strings w1 ... wk if
• there is a computation that goes through states s0, s1, ..., sk
• and wi is in L(δ(si-1, si))

## GNFA: Removing States

• Basic process:
1. Remove a state and modify transitions so that language accepted does not change
2. Repeat 1 until only 2 states remain

• Algorithm: (See Figure 1.63, p. 72)
• Choose any state (except the start or accept state) as the state to remove. Call this state qrip.

• For all pairs of states (qi, qj) for which there is a pair of non-empty transitions R1 and R3, with R1 ∈ δ(qi, qrip) and R3 ∈ δ(qrip, qj):
• modify delta to change the transition from qi to qj
• That is, modify δ(qi, j)

## Removing States: Notes on Modifying Delta

• For strings that go through qrip to get from qi to qj, the modifications take those same strings directly from qi to qj

• The modifications do nothing to modify the set of strings that go from qi to qj without going through qrip
• Thus, removing qrip does not change the language accepted by the machine

• The direct transition from qi to qj may be φ before removing qrip

• qi and qj can be the same state

• Example 1: Figure 1.67, pg. 75
• Example 2: Figure 1.69, pg. 76

• ## GNFA: Procedure Convert

• Procedure Convert converts a GNFA into an equivalent Regular Expression (ie one that describes the language that the GNFA accepts)

• Convert: GNFA → RE

• Procedure Convert(G) formally defines the process of
• removing a state and
• modifying the resulting machine so that it it recognizes the same language

• For GNFA G, Convert(G) recursively calls Convert(G'), where GNFA G' is G with one state removed

## Proof of Claim 1.65

• Claim 1.65: For any GNFA G, L(Convert(G)) = L(G)
• That is, the language described by the RE returned by Convert(G) is the same as the language recognized by the machine G

• We use an inductive proof on k, number of states in G
• Basis: k = 2 states: When k=2, G has one transition. The RE on that transition describes all strings that take G from the initial state to the accept state. These strings make up L(G). Furthermore, Convert(G) returns the single RE on the single transtion of G. Thus, L(Convert(G)) is the set of strings described by the RE on G's transition. Thus L(Convert(G)) is equivalent to L(G) (ie the strings that take Convert(G) to the accept state are the strings generated by the RE on the single transition of G).

• Induction step:
1. Inductive Hypothesis: Assume Claim 1.65 is true for any machine that has k-1 states, for an arbitrary k > 2.
• That is, if Machine H has k-1 states, then L(Convert(H))=L(H).
• In words, if Machine H has k-1 states, then the RE produced by Convert(H) describes the same language as machine H recognizes

2. Now consider any machine G that has k states.

• First we show that L(G) = L(G'), where G' is the modified machine. [That is, we show that both machines recognize the same language]
1. We start by showing that if G accepts string w, then G' must also accept w (ie that L(G) ⊆ L(G'). Now, if G accepts w, then there are two possibilities: either G's computation on w does not contains qrip, or it does contain qrip:
1. If the computation on G does not contain qrip, then each transition on G in this computation can also be found on G', possibly as part of a union, and thus G' has a computation that accepts w.
2. If the computation on G does contain qrip, then for any run of qrips in the computation on G, the two states that bracket this run have a new RE in G' and this new RE takes strings between the two states in G).
Thus, L(G) ⊆ L(G') (ie if G accepts w, then G' accepts w)

2. Now we show that if G' accetps a string w, then G also accepts w (ie that L(G') ⊆ L(G). Suppose G' accepts input w. Then G accepts w because at any step in the computation in G', the RE on the transition between the pair of states in G'is constructed to describe the strings that go between the same two states in G. Thus, G must also accept w. Therefore L(G')L(G)

3. Thus, from 3.1 and 3.2 we conclude that L(G) = L(G'), that is G and G' accept the same set of strings.

• Now, by the construction of G', if G has k states, then G' has k-1 states, and by the Inductive Hypothesis we know that L(G') = L(Convert(G')).

• Finally, when Convert is written so that if G has more than 2 states, then Convert(G) returns Convert(G'). That is, by definition, Convert(G) = Convert(G') which means that L(Convert(G')) = L(Convert(G)).

• Thus, from the three points above we have
• L(G)=L(G')
• L(G') = L(Convert(G'))
• L(Convert(G')) = L(Convert(G))
Finally, we can conclude that L(G) = L(Convert(G)) , our desired result.

• We have shown that Claim 1.65 is true when G has 2 states. We have also shown that when we assume it is true when G has k-1 states, then we can show it is also true when G has k states. Thus, Claim 1.65 is true, by induction, for all GNFA that have 2 or more states. This is what we wanted to prove!.

## Conclusion: RE ↔ RL

• Remember: Convert: GNFA → RE

• We have now proved the following:

• Claim 1.65: L(G) = L(Convert(G)) for GNFA G

• Lemma 1.60: If a language is regular, then it is described by a RE
• This follows from Claim 1.65
• What is the chain of reasoning?

• Lemma 1.55: If a language is described by a RE, then it is regular
• Why?

• Thus we conclude:
• Theorem 1.54: A language is regular iff some RE describes it

ITEC 420 Course Page,