ITEC 420  Chapter 2  Context Free Languages
Section 2.1  Context Free Grammars
Chapter 2  Overview
 Two new language definition techniques:
 PDA = NFA + Stack
 CFG  Context Free Grammar
 PDA and CFG are equivalent in power
 New class of languages: CFL  Context Free Languages
 Every Regular Language is also a CFL
 Application area: Compilers:
 Use CFG to describe programming language syntax
 Compilers implement PDA to recognize syntax
Grammars Generate/Derive Strings
 Grammars provide a new way of defining a language
 L(G) represents the language defined by a grammar
 We say that a grammar generates (or derives) a string
Generating Strings  Derivations
 Grammars use a sequence of substitutions to derive a string
 Entire sequence of substitutions is called a derivation
 Each substitution replaces one symbol with one or more symbols
 Each substitution brings string closer to the final derived string
Defining a Grammar
 Define 4 things:
 Two sets of symbols:
 V  set of symbols that are substituted
 Σ  set of symbols that are found in strings in the language
 R  Set of possible substitutions
 S  A designated symbol that is used to start the process
 Names:
 V  Variables (aka nonterminals)
 Σ  Terminals
 R  Rules (aka Productions)
 S  Start Symbol
Example Grammar
A → 0A1
A → B
B → #
Terminals: {0, 1, #
}
Variables: {A, B
}
Start Symbol: A
Example Derivation
A ⇒ 0A1
⇒ 00A11
⇒ 000A111
⇒ 000B111
⇒ 000#111
What language does this grammar define (ie what is L(G))?
Another Example Grammar
S → AcB
A → aA
A → a
B → bB
B → b
Terminals: {...}
Variables: {...}
Start Symbol: ?
Derive aaacb:
S ⇒ AcB
⇒ ...
...
⇒ aaacb
What is L(G2)
Example Grammar with ε
S → AB
A → aA
A → ε
B → bB
B → ε
Terminals: {...}
Variables: {...}
Start Symbol: ?
Some strings that it derives:
What is L(G3)
Common Practice in Grammar Definitions
 Terminals are lower case
 Variables:
 Variables are upper case
 Sometimes distinguish with different fonts
 Sometimes use angle brackets around variables (ie <NOUN>)
 Start Symbol:
 Usually taken to be the first one listed, if not otherwise specified
 Frequently actually is the symbol S
Parse Tree
 Pictoral representation of derivation
 Root is start symbol
 Children of nodes are RHS of substitution rules
 Example: 00#11
 Nodes and grammars:
 Leaves are ...
 Internal nodes are ...
 Connection between parse tree and derivation
 Every step in derivation corresponds to subtree
Context Free Grammar: Definition
 A CF Grammar G is a 4tuple: (V, Σ R, S) such that
 V and Σ are finite sets of symbols
 V ∩ Σ = {}
 In other words, the variables and terminals have no symbols in common
 Substitution rules have:
 Single variable on left
 String (of variables and terminals) on right
 S ∈ V
Context Free Language: Definition
 A CFL is a language that can be generated by a CFG
 If G1 is a CFG then L(G1) is a CFL
Non Context Free Grammar
 Hmmm, are there grammars that are not Context Free?
 Yes: Allow arbitrary strings on the left side of rules
 Example: a^{n}b^{n}nc^{n}n is not a CFL. A grammar for it
contains these rules:
...
BA → AB
bA → Ab
...
Such grammars are called Context Sensitive
Some Grammar Shorthands
 Grammar G1  original version:
A → 0A1
A → B
B → #
Grammar G1:
A → 0A1
→ B
B → #
Grammar G1:
A → 0A1
 B
B → #
Grammar G1:
A → 0A1  B
B → #
Some Example Languages
 What grammar generates these CFL?
 a^{n}b^{n}, n > 0
 a^{n}b^{n}, n ≥ 0
 0*
 0*1*
 (0 ∪ 1)*
CFL for Regular Languages
 We can build a grammar that generates a Regular
languages:
 Convert Regular Expression to a grammar
 Convert D/NFA to grammar
 Regular Expression:
 Example: (a U b)(c U d)*
 Union: S → A  B, where A and B derive ...
 Star: S → XS  ε, where X derives
 Concatenation: S → AB, where A and B derive ...
...
 Convert D/NFA to CFG
 R_{i} → aR_{j} if a takes DFA
from q_{i} to q_{j}
 Add rule R_{i} → ε (or
R_{i} → t) for accept states
 Start state gives start symbol
 Every Regular Language can be expressed using a Regular Grammar
 In a Regular Grammar, all rules have the form V → t  tU
 Alternatively: of the form V → w  wT, where w is
a string of terminals
 Regular Grammars are not in book and we mostly
ignore them
Expression Grammar
 Consider arithmetic expressions of the form a + b * c  a
 Can we express this with a RE?
 Does it express precedence and associativity?
 A precedence grammar for arithmetic expressions
 Example: grammar for a * b + c * d + e
 First, generate T + T + T
 Then expand 2 of the T's into F * F
 Then expand T's or F's into a, b, c, d
 Precedence Grammar:
 E → E + T  T
 T → T * F  F
 F → a  b  c  d
 How do we add parentheses?
 Consider a * b + c vs a * (b + c)
 Where do the parenthesized expressions fit in?
Other Tips on Defining CFL
 Think about what happens to the string when a substitution is made (eg string length and number of
variables and terminals)
 Always remember to consider what a given variable
derives
 To remember information (ie link size of one substring
to another): R → uRv
 Remember recursion:
 Direct recursion (eg S → aS)
 Indirect Recursion: eg F → (E)
 Building a Total Language Tree can help understand strings produce by a grammar:
 Nodes contain strings of terminals and/or variables
 Root is Start symbol
 For each node the children are all possible strings that
result when substituting all possible RHS for every variable, one at a time
 Leaves are strings from the start symbol
 Tree may be infinite
 Example: Expression grammar:
E


/ \
E+T T
 
 
/ / \ \ / \
E+T+T T+T E+T*F E+F T*F F
CFG Terminology: yields, derives, language of a
grammar
 Yields
 A string in a derivation yields the next
 uAv yields uwv, written uAv ⇒
uwv, for strings u,v,w of
terminials and variables and rule A → w
 Derives
 A string in a derivation derives all that follow
 u derives v , written u ⇒* v,
if u=v or u ⇒ u_{1} ⇒ ... ⇒
u_{k} ⇒ v
for k ≥ 0.
 The language of the grammar is
{w ∈ Σ*  S ⇒* w}
Leftmost and Rightmost Derivations
 A leftmost derivation always replaces the leftmost variable
 A rightmost derivation always replaces the
rightmost variable
 Why do we care?
 No difference in strings that a grammar derives with
left/right most derivations
 Left/rightmost may use different productions (and
thus different parse tree and structure)
 In real compilers, leftmost are tied to top down parsers,
rightmost to bottom up parsers
Leftmost Derivations and Ambiguity
 A grammar is ambiguous if it generates the same string
in two different leftmost derivations
 Example  Ambigous expression grammar:
 E → E + E  E * E  a
 Consider a + a + a
 Equivalent definition:
A grammar is ambiguous if it generates
the same string in two different parse trees
 Some programming languages have ambiguous grammars (eg x*y in C, dangling
else).
 Parsing ambiguous grammars directly would require backtracking,
which is slow
 Compilers can be built with additional rules to disambiguate
 Is there an algorithm to determine if a grammar is
ambiguous:
 No, this problem is undecidable
 That is, we can prove that there is no such algorithm
 Not even for context free grammars
Inherent Ambiguity
 Some languages can be generated by an ambiguous grammar
and by an unambiguous grammar
 In other words, there are languages that are generated by an
ambiguous grammar and by an unambiguous grammar
 Example: ...
 An inherently ambiguous language is generated only
by ambiguous grammars

In other words, ALL grammars that generate an inherently abiguous
language are ambiguous.
 {a^{i}b^{j}c^{k}  i =
j or j = k} is inherently ambiguous
 What would a context free grammar for this language look like?
Chomsky Normal Form (CNF)
 Sometimes it helps to have a standard way of writing grammars
 This is called a Normal Form
 "Normal" means "standard" in this context
 We use Chomsky Normal Form (CNF)
 We will define an algorithm for converting any CFG to CNF
 Noam Chomsky
 A grammar is in CNF if every rule is in one of these forms:
where
 A, B and C are any variables
 except B and C may not be the start variable,
(ie start symbol cannot be on RHS)
 α is any single terminal symbol
 α can be ε only when A is the start symbol
 We also allow S → ε where S is the start symbol
Chomsky Normal Form  Example
 The following grammar is in CNF
 S → AB
 A → AB
 A →
a
 B → BA
 B →
b
 What language does this describe?
Chomsky Normal Form  Motivation
 CNF simplifies defining algorithms for working with grammars
 In particular, we will use it when we prove the equivalence of
CFG and PDA
 Other Normal Forms are defined and and used for other purposes
Conversion to CNF
 Theorem 2.9: Any CFL is generated by a CFG in CNF
 Theorem 2.9 Proof by construction: Algorithm to convert any CFG to CNF
 Idea underlying algorithm: Eliminate rules that violate CNF and repair grammar so that it
generates the same language
 Add a new start symbol: S_{0} → S (so that start symbol is not on any RHS)
 Remove rules of form A → ε and repair affected rules
to keep language the same
 Handle all unit rules:
 Remove rules of form A → B
 Repair by adding A → u, for B → u,
where u is a string of variables and terminals
 unless B → u is a unit rule that was already
removed
 Convert remaining rules (eg A → BCD and A → aB) in
obvious way
Conversion to CNF  Examples
 Consider 0^{*}1^{*}:
 Grammar 1:
 Grammar 2:
 S → AB
 A → aA  ε
 B → bB  ε
 Grammar 3:
 S → AB  ε  A  B
 A → aA  a
 B → bB  b
 Consider 0^{n}1^{n}
 Example 2.10 (page 108)
Some Undecidable Problems for CFL and CFG
 Assume L and M are Context Free Languages
 Assume G and H are Context Free Grammars
 The following are undecidable (ie no algorithm exists):
 L ∩ M = φ
 L^{c} = φ
 L^{c} is a Context Free Language
 L is a regular language
 L can be recognized by a deterministic PDA
 G is ambiguous
 G generates an inherently ambiguous language
 L(G) = L(H)
 These are presented for interest only; we won't study them further.
 They can be proved using the Post Correspondence Problem
(which unfortunately we don't have time for).