ITEC 380 - Chapter 3 - Syntax
Lexical Structure of Programming Languages
Aspects of Studying Programming Languages
- There are three aspects of studying programming languages:
- Syntax: A description of
- the well-formed sentences (ie programs) of the language
- the structure of (well-formed) sentences
- Semantics: A description of
the meaning of the parts (ie statements) of the language
- the meaning of groups of statements (ie programs).
Pragmatics: Everything else (eg quality, cost)
How to Specify Syntax
-
We specify syntax with a mixture of
- English
- Grammars
- Regular expressions
- Other techniques
- Our focus is grammars and regular expressions
Program Syntax - Two Areas
- We break study of syntax into two areas:
- Lexical analysis: Deals with defining and finding the words of language
- Syntax analysis: Deals with defining and finding the structure of groups
of words in a program
Lexical Analysis: Defining and Finding Tokens
- Lexical analysis has 2 goals:
- Goal 1: Define structure of the tokens of the language
- A token is a word in the language
- Goal 2: Find the tokens of the language in a program
- Break a list of characters into a list of token
- First phase of translator (ie scanner)
- Example: Find the tokens in:
sum := v1 * v2 ** 10;
Categories of Tokens
- Reserved words (aka keywords) (eg if, else)
- Literals, constants (eg 42, "Hello", true)
- Issue: how flexible is a new language in allowing
definition of new literals
- Special symbols (eg :, :=, <, <=)
- Issue: how flexible is a new language in allowing
definition of new meaning for symbols
- Identifiers (eg myName, theState, toString)
- Issue: Are some identifiers predefined?
- Issue: max length, significant length?
Token Identification
- What are some issues with finding tokens?
- Principle of longest substring
- What could happen with a variable called
fori
?
- Must define token delimiters (eg special symbols, white space)
- Is the language free and fixed format
- Most modern languages are free format
- Exceptions (modern and older):
- ABC, Python
- FORTRAN: Token structure and syntax are intertwined
Token Specification: Regular Expressions
- Regular expressions are patterns that are used to give a
precise, concise description of tokens
- Example - Numbers: [0-9]+
- Example - Identifiers: L(L|D)*
Regular Expression Notation
- [0-9] - one from the range 0-9
- [0-9]+ - one or more from range 0-9
- (L|D)
- L - shorthand for [a-zA-Z]
- D - shorthand for [0-9]
- (L|D) - one letter or digit
- (L|D) 0 or more letters or digits
- Careful to distinguish language and metalanguage
- Example: in [0-9]+ the plus symbol means 1 or more
- Example: in 2 + 3 the plus symbol is a token that means addition
- Use different fonts if not obvious from context
Tokens and Lexemes
- Lexemes are the actual words from the program (eg myVar, :=)
- Tokens are the category of the lexeme (eg identifier, assignment
Creating Scanners
- Lex/flex:
- Input: Regular expressions describing tokens
- Output: Scanner
- Hand created
- Finite state machine shows transition for each input symbol
- Scanner encodes actions of the finite state machine
- Example from text
- Translator repeatedly calls lex()
- INT_LIT has associated value