ITEC 420 - Chapter 2 - Context Free Languages
Section 2.1 - Context Free Grammars
Chapter 2 - Overview
- Two new language definition techniques:
- PDA = NFA + Stack
- CFG - Context Free Grammar
- PDA and CFG are equivalent in power
- New class of languages: CFL - Context Free Languages
- Every Regular Language is also a CFL
- Application area: Compilers:
- Use CFG to describe programming language syntax
- Compilers implement PDA to recognize syntax
Grammars Generate/Derive Strings
- Grammars provide a new way of defining a language
- L(G) represents the language defined by a grammar
- We say that a grammar generates (or derives) a string
Generating Strings - Derivations
- Grammars use a sequence of substitutions to derive a string
- Entire sequence of substitutions is called a derivation
- Each substitution replaces one symbol with one or more symbols
- Each substitution brings string closer to the final derived string
Defining a Grammar
- Define 4 things:
- Two sets of symbols:
- V - set of symbols that are substituted
- Σ - set of symbols that are found in strings in the language
- R - Set of possible substitutions
- S - A designated symbol that is used to start the process
- Names:
- V - Variables (aka non-terminals)
- Σ - Terminals
- R - Rules (aka Productions)
- S - Start Symbol
Example Grammar
A → 0A1
A → B
B → #
Terminals: {0, 1, #
}
Variables: {A, B
}
Start Symbol: A
Example Derivation
A ⇒ 0A1
⇒ 00A11
⇒ 000A111
⇒ 000B111
⇒ 000#111
What language does this grammar define (ie what is L(G))?
Another Example Grammar
S → AcB
A → aA
A → a
B → bB
B → b
Terminals: {...}
Variables: {...}
Start Symbol: ?
Derive aaacb:
S ⇒ AcB
⇒ ...
...
⇒ aaacb
What is L(G2)
Example Grammar with ε
S → AB
A → aA
A → ε
B → bB
B → ε
Terminals: {...}
Variables: {...}
Start Symbol: ?
Some strings that it derives:
What is L(G3)
Common Practice in Grammar Definitions
- Terminals are lower case
- Variables:
- Variables are upper case
- Sometimes distinguish with different fonts
- Sometimes use angle brackets around variables (ie <NOUN>)
- Start Symbol:
- Usually taken to be the first one listed, if not otherwise specified
- Frequently actually is the symbol S
Parse Tree
- Pictoral representation of derivation
- Root is start symbol
- Children of nodes are RHS of substitution rules
- Example: 00#11
- Nodes and grammars:
- Leaves are ...
- Internal nodes are ...
- Connection between parse tree and derivation
- Every step in derivation corresponds to subtree
Context Free Grammar: Definition
- A CF Grammar G is a 4-tuple: (V, Σ R, S) such that
- V and Σ are finite sets of symbols
- V ∩ Σ = {}
- In other words, the variables and terminals have no symbols in common
- Substitution rules have:
- Single variable on left
- String (of variables and terminals) on right
- S ∈ V
Context Free Language: Definition
- A CFL is a language that can be generated by a CFG
- If G1 is a CFG then L(G1) is a CFL
Non Context Free Grammar
- Hmmm, are there grammars that are not Context Free?
- Yes: Allow arbitrary strings on the left side of rules
- Example: anbnncnn is not a CFL. A grammar for it
contains these rules:
...
BA → AB
bA → Ab
...
Such grammars are called Context Sensitive
Some Grammar Shorthands
- Grammar G1 - original version:
A → 0A1
A → B
B → #
Grammar G1:
A → 0A1
→ B
B → #
Grammar G1:
A → 0A1
| B
B → #
Grammar G1:
A → 0A1 | B
B → #
Some Example Languages
- What grammar generates these CFL?
- anbn, n > 0
- anbn, n ≥ 0
- 0*
- 0*1*
- (0 ∪ 1)*
CFL for Regular Languages
- We can build a grammar that generates a Regular
languages:
- Convert Regular Expression to a grammar
- Convert D/NFA to grammar
- Regular Expression:
- Example: (a U b)(c U d)*
- Union: S → A | B, where A and B derive ...
- Star: S → XS | ε, where X derives
- Concatenation: S → AB, where A and B derive ...
...
- Convert D/NFA to CFG
- Ri → aRj if a takes DFA
from qi to qj
- Add rule Ri → ε (or
Ri → t) for accept states
- Start state gives start symbol
- Every Regular Language can be expressed using a Regular Grammar
- In a Regular Grammar, all rules have the form V → t | tU
- Alternatively: of the form V → w | wT, where w is
a string of terminals
- Regular Grammars are not in book and we mostly
ignore them
Expression Grammar
- Consider arithmetic expressions of the form a + b * c - a
- Can we express this with a RE?
- Does it express precedence and associativity?
- A precedence grammar for arithmetic expressions
- Example: grammar for a * b + c * d + e
- First, generate T + T + T
- Then expand 2 of the T's into F * F
- Then expand T's or F's into a, b, c, d
- Precedence Grammar:
- E → E + T | T
- T → T * F | F
- F → a | b | c | d
- How do we add parentheses?
- Consider a * b + c vs a * (b + c)
- Where do the parenthesized expressions fit in?
Other Tips on Defining CFL
- Think about what happens to the string when a substitution is made (eg string length and number of
variables and terminals)
- Always remember to consider what a given variable
derives
- To remember information (ie link size of one substring
to another): R → uRv
- Remember recursion:
- Direct recursion (eg S → aS)
- Indirect Recursion: eg F → (E)
- Building a Total Language Tree can help understand strings produce by a grammar:
- Nodes contain strings of terminals and/or variables
- Root is Start symbol
- For each node the children are all possible strings that
result when substituting all possible RHS for every variable, one at a time
- Leaves are strings from the start symbol
- Tree may be infinite
- Example: Expression grammar:
E
|
------------------
/ \
E+T T
| |
------------ -----
/ / \ \ / \
E+T+T T+T E+T*F E+F T*F F
CFG Terminology: yields, derives, language of a
grammar
- Yields
- A string in a derivation yields the next
- uAv yields uwv, written uAv ⇒
uwv, for strings u,v,w of
terminials and variables and rule A → w
- Derives
- A string in a derivation derives all that follow
- u derives v , written u ⇒* v,
if u=v or u ⇒ u1 ⇒ ... ⇒
uk ⇒ v
for k ≥ 0.
- The language of the grammar is
{w ∈ Σ* | S ⇒* w}
Leftmost and Rightmost Derivations
- A leftmost derivation always replaces the leftmost variable
- A rightmost derivation always replaces the
rightmost variable
- Why do we care?
- No difference in strings that a grammar derives with
left/right most derivations
- Left/rightmost may use different productions (and
thus different parse tree and structure)
- In real compilers, leftmost are tied to top down parsers,
rightmost to bottom up parsers
Leftmost Derivations and Ambiguity
- A grammar is ambiguous if it generates the same string
in two different leftmost derivations
- Example - Ambigous expression grammar:
- E → E + E | E * E | a
- Consider a + a + a
- Equivalent definition:
A grammar is ambiguous if it generates
the same string in two different parse trees
- Some programming languages have ambiguous grammars (eg x*y in C, dangling
else).
- Parsing ambiguous grammars directly would require backtracking,
which is slow
- Compilers can be built with additional rules to disambiguate
- Is there an algorithm to determine if a grammar is
ambiguous:
- No, this problem is undecidable
- That is, we can prove that there is no such algorithm
- Not even for context free grammars
Inherent Ambiguity
- Some languages can be generated by an ambiguous grammar
and by an unambiguous grammar
- In other words, there are languages that are generated by an
ambiguous grammar and by an unambiguous grammar
- Example: ...
- An inherently ambiguous language is generated only
by ambiguous grammars
-
In other words, ALL grammars that generate an inherently abiguous
language are ambiguous.
- {aibjck | i =
j or j = k} is inherently ambiguous
- What would a context free grammar for this language look like?
Chomsky Normal Form (CNF)
- Sometimes it helps to have a standard way of writing grammars
- This is called a Normal Form
- "Normal" means "standard" in this context
- We use Chomsky Normal Form (CNF)
- We will define an algorithm for converting any CFG to CNF
- Noam Chomsky
- A grammar is in CNF if every rule is in one of these forms:
where
- A, B and C are any variables
- except B and C may not be the start variable,
(ie start symbol cannot be on RHS)
- α is any single terminal symbol
- α can be ε only when A is the start symbol
- We also allow S → ε where S is the start symbol
Chomsky Normal Form - Example
- The following grammar is in CNF
- S → AB
- A → AB
- A →
a
- B → BA
- B →
b
- What language does this describe?
Chomsky Normal Form - Motivation
- CNF simplifies defining algorithms for working with grammars
- In particular, we will use it when we prove the equivalence of
CFG and PDA
- Other Normal Forms are defined and and used for other purposes
Conversion to CNF
- Theorem 2.9: Any CFL is generated by a CFG in CNF
- Theorem 2.9 Proof by construction: Algorithm to convert any CFG to CNF
- Idea underlying algorithm: Eliminate rules that violate CNF and repair grammar so that it
generates the same language
- Add a new start symbol: S0 → S (so that start symbol is not on any RHS)
- Remove rules of form A → ε and repair affected rules
to keep language the same
- Handle all unit rules:
- Remove rules of form A → B
- Repair by adding A → u, for B → u,
where u is a string of variables and terminals
- unless B → u is a unit rule that was already
removed
- Convert remaining rules (eg A → BCD and A → aB) in
obvious way
Conversion to CNF - Examples
- Consider 0*1*:
- Grammar 1:
- Grammar 2:
- S → AB
- A → aA | ε
- B → bB | ε
- Grammar 3:
- S → AB | ε | A | B
- A → aA | a
- B → bB | b
- Consider 0n1n
- Example 2.10 (page 108)
Some Undecidable Problems for CFL and CFG
- Assume L and M are Context Free Languages
- Assume G and H are Context Free Grammars
- The following are undecidable (ie no algorithm exists):
- L ∩ M = φ
- Lc = φ
- Lc is a Context Free Language
- L is a regular language
- L can be recognized by a deterministic PDA
- G is ambiguous
- G generates an inherently ambiguous language
- L(G) = L(H)
- These are presented for interest only; we won't study them further.
- They can be proved using the Post Correspondence Problem
(which unfortunately we don't have time for).