home—lectures—recipe—exams—hws—D2L—breeze (snow day)

lect22-grammar-cont
grammars, cont.

Review Grammars

Parse tree.
- Every internal node is a non-terminal;
- every leaf is a terminal;
- every branching reads directly from one of the rules in the grammar.

Note that for one given tree, there might be many derivations (just having with productions in different orders). The derivations are easier to type than a tree (auto-grading!); we'll compromise by always asking for the “leftmost derivations”: always expand the leftmost nonterminal

Recall: We write “→” for entire grammar rules, and “⇒” for individual derivations.

Examples:

What is a grammar for an int literal, in (say) Java?
What are examples of the data; then see if you can make a grammar (a data-definition).
Compare here ¹ In the link: Does Digits ⇒* 3__4? Show, or argue why not.
What is a grammar for numeric arithmetic expressions?

Some things that arose:

extended grammars that let you use handy things like “ _opt” or “ε” are nice, but we see they're just syntactic sugar.
There can be multiple derivations corresponding to the same tree. I'll ask for the left-most derivation, esp. when you have a word processor to use. (On exams, I might prefer trees, to avoid writing.)
There might be different grammars that are equivalent. In particular, the rule Digits → Digit Digits will parse the same strings that Digits → Digits Digit does. (one will have leftward-leaning parse trees and the other rightward).
Take ITEC420 to talk about: when grammars are equivalent; converting (some) grammars to others fundamental limits on what grammars can express (and, some algorithms for parsing);

It is helpful to actually divide reading-a-program's-input into three phases:

tokenizing (or "lexing"): a quick pass to chunk/group characters into higher leve. tokens, like identifiers, int-literals, string-literals, etc.

While it is conceivable to omit this step, having grammar non-terminals of (say) “Digit” instead of “IntLiteral”, it would mean that the resulting parse tree would have three entire nodes for the literal “765”. Having a parse tree with that level of bushiness is not actually helpful when trying to work with the program.
parsing: build the parse-tree.
sanity-checking : look for literals that are larger than Integer.MAX_VALUE or Long.MAX_VALUE; maybe look keywords being used where just identifiers were expected; look for for unbound-identifiers; unreachable return-statemnts, etc. These are all things that are easier to do once the parse tree has been built.

Discuss the above w.r.t. XML.

In Programming Languages and Interpretation §1.4, Shriram Krishnamurthi argues that what I here call “parsing” can benefit from being decomposed into two smaller phases: reading the tokenized input into a simple hierarchy (perhaps just based on matched parentheses), at which point fully parsing/validating the grammar becomes much easier, since the input is already in structure-rich form that is easy for the parser/validator to work with. Indeed, using LISP/scheme/racket's http://docs.racket-lang.org/reference/Reading.html?q=read#%28def._%28%28quote._~23~25kernel%29._read%29%29 to create a basic tree (S-Expression) is both highly-reusable in many contexts, and significantly helpful in those contexts — the mark of a good decomposition of tasks.

Note that the grammar for Java doesn't quite match "does the program compile" — the CFG doesn't catch using-an-undefined-variable, for example. (CFG grammars cannot capture this!) This fact is usually ignored, and most everybody will say "yes there is a CFG for Java," even though the set of legal (compilable) programs is determined

New terms:

the language a grammar defines.
"parse": a function from string → parse-tree.
(We'll take each "parse" to be relative to some existing grammar ².)
ambiguous: If one string has two different (valid) parse-trees for it.
don't confuse: the grammar(syntax), vs. semantics

  S → aS
  S → b

A Language: a set of strings. (e.g. the set of all valid Java programs — an infinite set). What language does the above grammar define?

The language defined by a grammar: the set of all strings derivable. To show that a string is in the grammar, show me a derivation (possibly a left-most derivation); equivalently show me a parse-tree.

To specify a grammar, you officially need four things:

What are the terminals symbols?
(Our default: uppercase.)
What are the non-terminal symbols?
(Our default: lowercase.)
What are the rules? A list of pairs of strings.
Written as lhs → rhs
What is the start-nonterminal, for the grammar?
(our convention: the first rule's left-hand-side is the start symbol.)

Most commonly seen: context-free grammars, “CFG”: every rule's left-hand-side is a single non-terminal.
BNF: same as CFG, but use "::=" instead of →, and "|".

regular grammar: each right-hand-side has at most one non-terminal. (equivalent to regular expresssions, in expressibility: A langauge is generated by some regular grammar if, and only if, there is a regular expression that matches exactly the strings in a language. We'll say "the language is regular", w/o committing ourselves to any particular grammar or reg.exp. (Note that these are regular expressions without back-references!)

A grammar for

 
<integer> →  <digit><integer> | <digit>
<digit> →  0 | 1 | 2 | 3 ... | 9

<ident> →  <letter>|<letter><letterOrDigitSeq>

<letterOrDigitSeq> →  <letterOrDigit>|<letterOrDigit><letterOrDigitSeq>

<letterOrDigit> →  <letter> | <digit>

<letter> →  a | b | c ...

For comparison: java syntax: numbers

EBNF: Allow shorthands in BNF: For a non-terminal A,

0-or-more As: A* (or, A…)
1-or-more As: A+
0-or-1 (optional) α/A: A? or [α]

Note that while parentheses are used for grouping in regular expressions, but usally not used in EBNF. (How do we get around that?)

Example: Write a simple grammar that matches arithmetic-expressions:

Imagine that you already have a nonterminal “Num” that matches numbers. Now, give me a parse tree of 29 + 8 * 3, using your grammar. … did anybody come up with a second, different tree? If so, which one is right?

We say that a grammar is ambiguous iff one string has two different (legal) parse trees. We don't like ambiguity. There are several solutions:

Book example: add some additional nonterminals to the grammar so that we can only generate one of those two parse trees. That is, make the grammar more complicated so that the “correct precedence” always auto-happens.
(This extra nonterminal is usually called a “Term” in this particular example, and it corresponds to the product of 1-or-more expressions. The grammar is carefully constructed so that + only occurs between entire Terms, so Terms are already parse-trees before + comes along.)
Java: Leave the grammar simple, but then add a of rules about operator precedence (46 operators, with 14 levels of precedence!). This extra knowledge is needed to build the correct parse tree, even though it's not part of the context-free grammer per se.
Lisp/scheme/racket: keep the syntax simple, obviating any ambiguity (at the price of the programmer sometimes adding parentheses where other languages rely on intuition).

Task: Can you make a grammar Ada's int-literals, which allow underscores (similar to Java 8), except that: You can't start or end with an underscore, and can't have two in a row.

Can you make a grammar for XML?

¹This spec seems not to mention signs. That's because Java considers + and - to be unary operators, and not part of the int-literal itself. ↩

² For a truly grand view, you can think about an “überparse” function that accepts a string and a grammar. Indeed, there are (curried versions of) this: parser-generators which take in a grammar and return a parse function, e.g. Bison. See also: compiler-compilers like Yacc. ↩

home—lectures—recipe—exams—hws—D2L—breeze (snow day)

©2013, Ian Barland, Radford University
Last modified 2013.Oct.16 (Wed) Please mail any suggestions
(incl. typos, broken links)
to ibarlandradford.edu

lect22-grammar-cont grammars, cont.

Review Grammars

lect22-grammar-cont
grammars, cont.