ITEC 380 - Chapter 3 - Syntax

Lexical Structure of Programming Languages

Aspects of Studying Programming Languages

There are three aspects of studying programming languages:

Syntax: A description of

the well-formed sentences (ie programs) of the language
the structure of (well-formed) sentences

Semantics: A description of
- the meaning of groups of statements (ie programs).
Pragmatics: Everything else (eg quality, cost)

How to Specify Syntax

We specify syntax with a mixture of

English
Grammars
Regular expressions
Other techniques

Our focus is grammars and regular expressions

Program Syntax - Two Areas

We break study of syntax into two areas:

Lexical analysis: Deals with defining and finding the words of language

Syntax analysis: Deals with defining and finding the structure of groups of words in a program

Lexical Analysis: Defining and Finding Tokens

Lexical analysis has 2 goals:

Goal 1: Define structure of the tokens of the language

A token is a word in the language

Goal 2: Find the tokens of the language in a program

Break a list of characters into a list of token
First phase of translator (ie scanner)

Example: Find the tokens in: sum := v1 * v2 ** 10;

Categories of Tokens

Reserved words (aka keywords) (eg if, else)

Literals, constants (eg 42, "Hello", true)

Issue: how flexible is a new language in allowing definition of new literals

Special symbols (eg :, :=, <, <=)

Issue: how flexible is a new language in allowing definition of new meaning for symbols

Identifiers (eg myName, theState, toString)
- Issue: Are some identifiers predefined?
- Issue: max length, significant length?

Token Identification

What are some issues with finding tokens?

Principle of longest substring

What could happen with a variable calledfori?

Must define token delimiters (eg special symbols, white space)

Is the language free and fixed format

Most modern languages are free format
Exceptions (modern and older):

ABC, Python
FORTRAN: Token structure and syntax are intertwined

Token Specification: Regular Expressions

Regular expressions are patterns that are used to give a precise, concise description of tokens

Example - Numbers: [0-9]+

Example - Identifiers: L(L|D)*

Regular Expression Notation

[0-9] - one from the range 0-9

[0-9]+ - one or more from range 0-9

(L|D)

L - shorthand for [a-zA-Z]
D - shorthand for [0-9]
(L|D) - one letter or digit
(L|D) 0 or more letters or digits

Careful to distinguish language and metalanguage

Example: in [0-9]+ the plus symbol means 1 or more
Example: in 2 + 3 the plus symbol is a token that means addition
Use different fonts if not obvious from context

Tokens and Lexemes

Lexemes are the actual words from the program (eg myVar, :=)
Tokens are the category of the lexeme (eg identifier, assignment

Creating Scanners

Lex/flex:

Input: Regular expressions describing tokens
Output: Scanner

Hand created

Finite state machine shows transition for each input symbol
Scanner encodes actions of the finite state machine

Example from text

Translator repeatedly calls lex()
INT_LIT has associated value