|
home—lectures—recipe—exams—hws—D2L—breeze (snow day)
Recall: We write “→” for entire grammar rules, and “⇒” for individual derivations.
Examples:
Some things that arose:
It is helpful to actually divide reading-a-program's-input into three phases:
tokenizing (or "lexing"): a quick pass to chunk/group characters into higher leve. tokens, like identifiers, int-literals, string-literals, etc.
While it is conceivable to omit this step,
having grammar non-terminals of (say) “Digit”
instead of “IntLiteral”,
it would mean that the resulting parse tree
would have
three entire nodes for the literal “
In Programming Languages and Interpretation §1.4,
Shriram Krishnamurthi argues that
what I here call “parsing” can benefit from being
decomposed into two smaller phases: reading the tokenized input
into a simple hierarchy (perhaps just based on matched parentheses),
at which point fully parsing/validating the grammar becomes much easier,
since the input is already in structure-rich form that is easy for the
parser/validator to work with.
Indeed, using LISP/scheme/racket's
Note that the grammar for Java doesn't quite match "does the program compile" — the CFG doesn't catch using-an-undefined-variable, for example. (CFG grammars cannot capture this!) This fact is usually ignored, and most everybody will say "yes there is a CFG for Java," even though the set of legal (compilable) programs is determined
New terms:
S → aS S → b
A Language: a set of strings. (e.g. the set of all valid Java programs — an infinite set). What language does the above grammar define?
The language defined by a grammar: the set of all strings derivable. To show that a string is in the grammar, show me a derivation (possibly a left-most derivation); equivalently show me a parse-tree.
To specify a grammar, you officially need four things:
Most commonly seen:
context-free grammars, “CFG”:
every rule's left-hand-side is a single non-terminal.
BNF: same as CFG, but use "::=" instead of →, and "|".
regular grammar: each right-hand-side has at most one non-terminal. (equivalent to regular expresssions, in expressibility: A langauge is generated by some regular grammar if, and only if, there is a regular expression that matches exactly the strings in a language. We'll say "the language is regular", w/o committing ourselves to any particular grammar or reg.exp. (Note that these are regular expressions without back-references!)
A grammar for
<integer> → <digit><integer> | <digit> <digit> → 0 | 1 | 2 | 3 ... | 9 <ident> → <letter>|<letter><letterOrDigitSeq> <letterOrDigitSeq> → <letterOrDigit>|<letterOrDigit><letterOrDigitSeq> <letterOrDigit> → <letter> | <digit> <letter> → a | b | c ...For comparison: java syntax: numbers
EBNF: Allow shorthands in BNF: For a non-terminal A,
Example: Write a simple grammar that matches arithmetic-expressions:
3 + 7 143 3 + 7 + 99 29 + 8 * 3 |
We say that a grammar is ambiguous iff one string has two different (legal) parse trees. We don't like ambiguity. There are several solutions:
Task: Can you make a grammar Ada's int-literals, which allow underscores (similar to Java 8), except that: You can't start or end with an underscore, and can't have two in a row.
Can you make a grammar for XML?
1This spec seems not to mention signs. That's because Java considers + and - to be unary operators, and not part of the int-literal itself. ↩
2 For a truly grand view, you can think about an “überparse” function that accepts a string and a grammar. Indeed, there are (curried versions of) this: parser-generators which take in a grammar and return a parse function, e.g. Bison. See also: compiler-compilers like Yacc. ↩
home—lectures—recipe—exams—hws—D2L—breeze (snow day)
©2013, Ian Barland, Radford University Last modified 2013.Oct.16 (Wed) |
Please mail any suggestions (incl. typos, broken links) to ibarlandradford.edu |