Overview
- Syntax of programming languages
- Grammars
- Compilers and interpreters
Example Grammar
- Example Grammar for (very simple) English sentences (without punctuation):
- Nouns: either
cat
or dog
Represent with a rule like this:
<noun> ::= cat | dog
- Verbs:
<verb> ::= saw | chased
- Articles:
<article> ::= a | the
- Noun phrases:
<nounPhase> ::= <article>
<noun>
- Sentences:
<sentence> ::= <nounPhase>
<verb> <nounPhase>
Grammars and Languages
- How do we use a grammar to determine if a sentence is correct in a
given language?
- A sentence is correct in a language
if a grammar for the language can be used to derive the sentence
(ie if we can use the grammar to make a derivation
of the sentence.
Derivations
- A derivation shows how a sentence can be created using a grammar
- Start a derivation with the symbol
<sentence>
- At each step of the derivation, replace
a symbol with the right side of the corresponding rule.
- Use the double arrow to show a step in a derivation.
- Example: Derive "the cat saw the dog"
<sentence> ==> <nounPhrase> <verb> <nounPhrase>
==> <article> <noun> <verb> <nounPhrase>
==> the <noun> <verb> <nounPhrase>
==> the cat <verb> <nounPhrase>
==> the cat saw <nounPhrase>
==> the cat saw <article> <noun>
==> the cat saw the <noun>
==> the cat saw the dog
The sentence is a valid sentence in the language.
- Why? Because we could derive it
The sentence "cat saw dog" cannot be derived. It is not in the
language.
Parts of a Grammar
- A grammar has 4 parts:
- Terminals:
cat
, dog
, the
, ...
- Variables (aka Non-terminals): <noun>, <verb>, <article>,
<nounPhrase>, <sentence>
- Substitution Rules (aka Productions): See example
- Start Symbol: Special variable used to start derivations (eg <sentence>)
- Non-terminals represent parts of the language. Can be substituted for in
a derivation
- Terminals are the words of the language. Cannot be substituted for in
a derivation
- Start Symbol: typically the most general symbol
- Memorize these terms!
BNF
- A grammar is in BNF (Backus Naur Form) if each rule has
exactly one nonterminal on the left hand side
- BNF is an essential technique for programming language specification
- Invented by John Backus and Peter Naur
- Used to define Algol 60
Alternative Representation for Grammars
- To save writing, we use uppercase for nonterminals,
lower for terminals, and single arrows for rules
- Example:
N → cat | dog
- Must be careful that you don't confuse terminals and nonterminals.
- Sometimes we write alternatives using multiple productions
N --> cat
N --> dog
Example 2
<article> ::= a
<article> ::= the
Defining a Language
- A language is a (possibly infinite) set of sentences.
- A grammar defines a language: the set of all sentences that can be
derived using a grammar.
- Example - the language defined by the grammar above:
{"the cat saw the dog", "a cat saw the dog", "the dog saw the dog",
"a dog saw the dog", ...}
- Languages that have an infinite number of strings are covered below.
Parse Trees
- Derivations - the down side: can be repetitious and hard to follow.
- Solution: Represent the substitutions of a derivation
graphically with a parse tree.
- Root is a Start symbol
- Internal nodes are Variables
- Children of a node are the right sides of a production in a substitution
- Leaves are terminals
- Example:
the cat saw a dog
- A sentence is in a language if there is a parse tree of the
sentence.
Simple Languages
- Use simple languages to gain practice with grammars
- Use letters rather than words
- Don't put blanks between words
- Example:
S --> AB | ABC
A --> a | aa
B --> b | bb
C --> c
What are the parts of this grammar?
- Terminals:
a
, b
, and c
,
- Variables
- Start symbol
- Thus,
aa
and bb
are both sentences, written without blanks between words
Find the derivation and parse tree of abbc
(how many words in this sentence?)
Language defined by this grammar: {ab, aab, abb, abc, aabb, aabc, abbc, aabbc
}
Description of this language in English:
- All strings that start with 1 or 2 a's, followed by 1 or 2 b's, followed by 0 or 1 c.
Sentences with Arbitrary Length
- The languages we've seen so far have been a finite set of sentences.
- The languages we've seen so far have have had sentences with some maximum length.
- How do we make infinite sets of sentences which can be of any length?
- In an infinite set of sentences there is not a maximum sentence length
- Use your favorite tool, of course: Recursion!
A Recursive Grammar
- Example - What strings are generated by this grammar?
S --> aS | a
Notation for Representing Languages
- The language generated by
S --> aS | a
is
{a, aa, aaa, aaaa, ...}
- We call this language the set of all strings of one or more a's
- A shorthand notation for this language is
an, n > 0, where
an means a string of n a's
- For example,
a
3 means aaa
.
-
a
0 represents an empty string
-
Example:
ambn,
m ≥ 0,
n > 0 - what language does this represent?
Showing that a Grammar Generates a Language
- There are two parts to a proof that a grammar generates a
particular language:
- You must show that every sentence in the language can be
generated by the language
- You must show that every sentence generated by the grammar
is in the language
- For example, what must you do to show that the grammar
S --> aS | a
generates the language
an, n > 0:
More Recursive Grammars
- Example:
S --> abS| ab
- Example:
S --> aSb | ab
- Example:
S --> AB
A --> aA | a
B --> bB | b
One More Language
- Can we generate anbncn, n>0? Try this:
L --> aLbLc
| ab
| bc
Context Sensitive Grammar
- The language anbncn, n>0 is
impossible with BNF.
- A grammar for the language
anbnan, n>0 is
S --> aSBA
S --> abA
AB --> BA
bB --> bb
bA --> ba
aA --> aa
- This grammar is NOT a BNF grammar.
- We call this grammar a context sensitive grammar,
- Some substitutions can be made can only if the nonterminal has the right characters around it.
- That is, the nonterminal is in the correct context.
- For example, a
B
can be replaced by a b
only if it has a b
to its
left (ie bB → bb)
.
- When can an
A
be replaced by an a
?
Context Free Grammar
- BNF grammars have single terminal on the left side of a production
- BNF grammars are called context free since a
substitution can always be made (ie it does not matter what the
context of the nonterminal is)
- So, what is the grammar for anbnc
n?
S --> aSBC | aBC
Order of Substitution
- Notice: In a derivation, there is often a choice of which
nonterminal to substitute next.
- Does order of substitution matter?
- The order of substitution (choose one: DOES/DOES NOT)
change whether a given string can be derived by a grammar.
- A parse tree represents the substitutions of a derivation.
It does not show the ORDER of substitution of a
derivation.
- A given parse tree may correspond to several different derivations.
- As a practical matter,
a compiler is designed to follow a certain order.
Leftmost and Rightmost Derivations
- Having a standard order to follow simplifies comparing derivations.
- Two standard orders:
- Leftmost derivation: At each step, replace leftmost
nonterminal.
- Rightmost derivation: At each step, replace rightmost
nonterminal.
Ambiguous Grammars
- While order of substitution does NOT determine whether a given
string can be generated by a grammar, order MAY determine the structure
of the parse tree for the string.
- A grammar is ambiguous if it generates a string that has two
different parse trees.
- Example:
A --> aA | Aa | a
- This grammar is ambiguous because it generates a string
(
aa
, among others), which has two different parse trees.
- We try to avoid ambiguous grammarj
- Equivalent definitions: A grammar is ambiguous if ...
- it generates a string that has two different leftmost derivations.
- it generates a string that has two different rightmost derivations.
- Bogus definition: A grammar is ambiguous if it generates a string that has two different derivations.
A Grammar for Expressions
- How to make a grammar that generates arithmetic expressions {eg a, a+a, a*a, a-a+a, a+a/a, ...}
- Simple grammar
- What problems do you see?
Issues with Simple Expression Grammar
- This grammar works, but has problems
- Ambiguity: Two (or more) parse trees means two (or more) meanings
- Two (or more) parse trees for an expression give two (or more) meanings (ie values) to the expression
- Precedence:
- What is meaning of a+b*c?
- Associativity:
- What is meaning of a+a+a?
- When does it matter?
- Discussion of these issues is easier with a simpler representation:
- Solution: Abstract syntax tree (AST, aka Abstract Parse Tree)
Abstract and Concrete Parse Trees
- A Concrete parse tree shows every substitution of a derivation
- This can be more information than needed
- An abstract parse tree removes redundant information:
- Removes nonterminals
- Moves operators up into interior nodes
- Leaves terminals on leaves
- Example: Tree for
a+b*c
- Abstract: Shows structure, ignores derivation steps
- Pros and Cons:
- Easier to use
- Shows essentials of structure
- Does not show derivation steps
- Compilers normally use ASTs, concrete ones
Expressions and Tree Depth
- We don't want ambiguous grammars: we have a desired structure
- We will design grammars to give the desired structure
- The desired structure gives desired precedence and associativity
- Operations must, of course, have their operands before they can be executed
- When an expression is compiled, the code that is generated will
execute the operations in the tree from the bottom up.
- Example: Code for
a+b*c
:
mul b, c, $t1
add a, $t1, $t2
- We say that the tree is evaluated from the bottom up
Expression Precedence
- Precedence: In an expression that has operators of different
precedence, the operators with higher precedence must be lower
in the tree.
- Example: Abstract tree for
a + b * c + d * e
- How to make operations be higher or lower:
- Introduce new variables that force deriving low precedence operations before high
precedence ones (ie make them closer to start symbol)
Precedence Grammar
- Introduce two variables to derive all plus and minus before all times and divide
- E → ...
- Example 1: a + b * c
- Example 2: a + b * c * d + e * f + g
- Buzz words: Expressions made up of Terms, which are made up
of Factors
- Questions:
- What associativity do we have?
- How do we handle parentheses?
Expression Associativity
- In an expression that has more than 1 operator at the same
precedence level:
- Left associative operations are done left to right
- Right associative are done right to left t
- Most operators are left associative
- Example: a + b + c
- How to accomplish left associativity:
- Left plus operator must be lower in the tree than the right
- This is accomplished by making the production that generates the + a left recursive production
- Example:
E → E + T
- Builds tree to left
- How to accomplish right associativity:
- Later we will see another way to handle associativity
- Because of problems with left recursive grammars
Expressions with Parentheses
- When are parenthesized expressions done?
- Where do parenthesized expressions need to be in the tree?
- How do we accomplish this?
- Expression grammar:
E --> E + T | E - T
T --> T * F | T / F
F --> ???
F --> a | b | c
Grammar for Identifiers and Integers
<integer> --> <digit><integer> | <digit>
<digit> --> 0 | 1 | 2 | 3 ... | 9
<ident> --> <letter>|<letter><letterOrDigitSeq>
<letterOrDigitSeq> --> <letterOrDigit>|<letterOrDigit><letterOrDigitSeq>
<letterOrDigit> --> <letter> | <digit>
<letter> --> a | b | c ...
- Equivalent to finite state machine and code seen earlier
- Implemented by scanner
- Restrictions on length are not expressed in the grammar
Partial Grammar for Programs
- The following grammar will parse (very) simple programs
<program> --> program <ident> is <stmtList> end;
<stmtList> --> <stmt> | <stmt> <stmtList>
<stmt> --> <ident> := <expr> ;
| if <boolExp> then <stmtList> end if ;
| if <boolExp> then <stmtList> else <stmtList> end if ;
| begin <stmt> end ;
| ...
<expr> --> <expr> + <expr>
| ...
| <ident>
| <integer>
| ( <expr> )
<boolExp> --> ...
<ident> --> ...
White space is handled by the scanner
Do you notice anything about the grammar for if statements?
- How is this parsed:
if .. then if .. then .. else ..
Another Program Grammar
- The following grammar generates almost the same
statements:
STMT --> MATCHED | UNMATCHED
MATCHED -> if BOOL then MATCHED else MATCHED
| ASSIGN
| begin STMT end
| ...
UNMATCHED -> if BOOL then STMT
| if BOOL then MATCHED else UNMATCHED
ASSIGN --> IDENT := EXPR ;
...
Every then
generated by MATCHED either has an
associated else
or it is in a begin/end
block.
Then
s without else
s can only be generated from UNMATCHED
- There is only one way to generate
if .. then if .. then .. else ..
-
UNMATCHED --> if .. then [if .. then .. else ..]
-
if .. then [if .. then ..] else ..
cannot be parsed since no
production has then UNMATCHED
-
if .. then [begin if .. then .. end] else ..
can be parsed since
MATCHED --> begin if .. then .. end
- The grammar is unambiguous
Ambiguous If - Other Solutions
- The grammar above solved the ambiguous if problem by forcing
an elseless if-then to either be nested in an else
or to be within within a begin/end
- Another solution is to require brackets on if statements:
- Ada brackets with end if
- Perl brackets by requiring {...} on all if statements
- Whether or not brackets are required affects how else if
statements are defined
- Languages without required brackets (ie Java, C) use else if
- else if has a single statement within the else
- Languages with brackets use elsif (ie Ada, Perl)
- If else if were used, then multiple end ifs would be
required for multiple else ifs
- Example: if else if else if else if ... end if; end if; end if; end if;
- Parser can also handle ambiguity by coding a special case
EBNF: Extended BNF
- Extended BNF (EBNF) is a shorthand for BNF:
- Easier to write
- No additional expressive power
- EBNF Notation:
- [] means optional
- (|) means alternates
- { } means 0 or more repetitions
Expression Grammar in EBNF
EBNF Grammar for If Statements
- The following EBNF grammar for if statements illustrates an
optional clause
IFSTMT -->
if BOOLEXPR then
STMTLIST
{ elsif BOOLEXPR
STMTLIST }
[ else
STMTLIST ]
end if;
Optional can be thought of as 0 or 1 occurrence
Issues with EBNF Grammar
- Not obvious what tree is represented
- Where is the recursion?
- 0 or more notation hides recursion in grammar
- Does the tree go right or left?
- No longer clear what a derivation is
- Does grammar show associativity? What about precedence?
- Associativity is implemented in the parser
- EBNF description of language commonly used when building a
parser