Chapter 6

Chapter 6: Data Types

Overview

Importance of types
Primitive types

Definitions
Numeric
Non-numeric

Boolean

Character

Strings

Primitive?
Static?
Operations
Implementation

User defined ordinal

Enumerated
Subrange

Arrays

Definition
Arrays and bounds checking
Categories
Initialization and Operations
Implementatino

Associative arrays
Records

Definition
Accessing fields
Implementation

Union Types

Definition
Free vs tagged
Implementation

Pointer Types

Definition
Operations
Problems
Languages: Ada, C, C++
Reference types

Pointer Implementation

Heap management

Evolution

Types are Important

Types allow thinking at problem level rather than machine level

Algorithms + Data Structures = Programs (a book by N. Wirth)

Programmer needs

Primitive types (eg integer)
User defined types
To be able to implement ADTs (Ch. 11)
Way of organizing data: Modules and classes (Chapter 11)

A programmer should understand the language's type system!

Chapter 6 Goals

We will study the following:

The types provided by different languages

Design issues involved with different types

Choices made by common languages

How types are implemented

Primitive Types

We study primitive and non-primitive types
The term primitive is used with several meanings:

Primitive	Non-Primitive
Book's def: Not defined in terms of other types	Defined in terms of other types
Provided by language	User defined
Scalar: has hierarchical structure	Structured or Composite: has hierarchical structure
Variable holds value	Reference type: variable holds a pointer to a value (eg Java)

Numeric Types

Integer

Signed: 2's complement most common
Unsigned integers allowed?
Size part of language definition

Floating Point:

Floating point implementation is hard

Problems include how to do rounding and truncation
Extra digits needed in calculations when aligning exponents

IEEE 754 Floating point standard resolved many issues (but issues remain)

Standard specifies how to handle overflow, underflow, +/- infinity, NaN, rounding

1 bit sign, 8 bit exponent , 23 bit significand

Largest number for single precision: 2.0 * 10³⁸, double precision: 2.0 * 10³⁰⁸

Smallest number for single precision: 2.0 * 10^-38, double precision: 2.0 * 10^-308

Exponent is biased (not 2's complement) notation to simplify sorting (goal: less than comparison for integers also works for floating point number)

Significand is normalized and first one bit is assumed

Example: 3/8 = 1/4 + 1/8 = 0.01 + 0.001 = 0.0110 = .1100 * 2^-1
is written as .1000 * 2^-1

Problem: how to represent 0.000 Answer: Special value of 32 zeros

Size of exponent and significand is tradeoff between accuracy and range

Fixed point: number of decimal (or binary) places of precision fixed

Good for banking
Cobol, Ada, PL/1
Implement as

string of digits or

integer with divisor: less space and faster ops

Ada allows both decimal or binary fixed point

Rational:

Separate numerator and denominator
Exact arithmetic: (1.0 / 3.0) * 3.0 = 1.0 ??
LISP ratios
Ada compile time constants: pi: constant := 3.14... to 50 places

Non-Numeric Primitive: Boolean

Issue: Separate type or designate values as true/false

Java, Ada, C# have a separate type: gives better type checking

C, C++, Perl, Python, ... designate numers, strings, pointers, ... as true/false

Expressive, but type checking catches fewer errors

Not normally stored as a bit - Why?

Issue: Built-in type or standard type

Introduced by Algol 60

Non-Numeric Primitive: Characters

Three common sizes:

ASCII (7 bit) or ISO 8859-1 (8 bit)
Unicode (16 bit)

First 128 characters are identical to ASCII
Java character
Ada Wide_Character
Python, Perl, C# and Javascript use Unicode

Wide Unicode (32 bit)

Ada Wide_Wide_Character

Issue for 16 and 32 bit: big-endian or little endian (ie are bytes ordered high to low or low to high addresses)

Usually a primitive type (ie Built-in type)
- Python: String of length 1
- Ada: Enumerated type in package Standard

Strings

Issue: Is a string

an array of characters (C, C++, Ada)
or a special type (Java, C#, Javascript, PPR (Perl, Python, Ruby))

Issue: String Length

Static (ie Fixed): Ada, Java, Python, C# .NET, C++ Standard library

May be immutable, as in Java

Limited Dynamic (ie Bounded): Ada, C, C++
Dynamic (ie Variable) : Ada, C++ SL, Javascript, Perl

Implementation

Static:

Header record or object containing length and address or array

Limited Dynamic:

Array of characters with end marker (C: 0 byte)
Header record with maximum length, current length, and address or array

Dynamic:

Linked list
Dynamic allocation:

Reallocate when size changes
Amortized (ie header with current length, allocate block, reallocate when block filled)

Evaluation:

Static: faster, but not flexible
Dynamic: slower, but flexible
Limited Dynamic: compromise speed and flexibility
Security: 0-terminated is dangerous
Reference or value semantics

Where allocated

Static: C string literals
Stack: Ada fixed and bounded length
Heap: Java strings, variable length strings

Operations

Allocation

Declaration: implicit or explicit:
Ada: foo: String(1..10); foo2: String := "Hi Mom";
java String s = "abc"; String s = new String("ABC");
C: string s = "abc"; # allocates 4 bytes
C: string s = malloc(10); # allocate room for 9 char string

Index, slice/substring
Substitution/changing length (Java Strings are immutable)
Pattern matching

User-Defined Ordinal Types

Ordinal Type: values can be ordered (ie has operations first, last, next [at least implicitly])

Enumerated types
Subrange types

Enumerated types:

Define a new type with a new set of values represented by identifiers
Ada, C#, introduced by Pascal, added to Java 1.5
Operations

declare
assignment
relational
array indices
loop index over subrange

Issue: Strongly typed? (ie no arithmetic such as blue + 1)

Ada, Java, C#: Strongly typed
C: int i = blue; mycolor = blue + 1; # all okay
C++: int i = blue; # okay
C++: mycolor = 111; # not okay
C++: mycolor = (color) 111; # okay

Enhance readability and type checking

Implemented as integers

Constrast with java (pre 1.5): public static final RED = 3;

Trends: returning to favor, but scripting langs don't use

Subrange types

Subrange of an ordinal type
Subtype compatible with parent type
Compile time and runtime type checking
Improves reliability and readability

Introduced by Pascal
Ada only common language

Array Definition

Mapping from ordinal type (normally integer) to some other type
Homogeneous aggregate (ie all elements are of same type)
Stored contiguously

Array Subscripts

Issue: What can be a subscript?

Any ordinal type?
Integers only?
Start at 0 or any range?

How are subscripts denoted:

() - not distinguishable from a function (conceptually the same as a function)
[] - distinguishable from a function

Are subscripts checked at runtime:

Yes: Ada, Java, C#, Python (Benefit?)
No: C, C++, Perl, Fortran (What happens if out of bounds? Benefit?)
perl checks if -w is used, otherwise not
Is a check needed in all situations?

Array Initialization

C, C++, C#, Java:

int list [] = {1, 2, 3};
compiler cannot check size
allows: int list [] = {11,12, 11, }; // Size?

Ada:

list: array (0..2) of integer := (11, 22, 33);
checks size - redundant information

list: array (0..2) of integer := (1 => 12, others => 11l);
aggregate value, keyword or positional values, others

Array Operations

Assignment, equality

reference vs value semantics

Aggregate value

list: := (11, 22, 33);

Mathematical operations: APL and Fortran
Slicing: Ada, Fortran

Multidimensional Array

Rectangular: Ada, C#, Fortran - store continguous rows (or rows, see below)

Example: a: array (1..3, 1..3) of integer; a(i, j) := 0;

Array of arrays: eg Ada, Java, C, C++, C# - store column of pointers

int[][] a = new int[3]; a[0] = new int[3]; a[1] = new int[3];
access with a[i][j]
Some languages use z(i, j) as syntactic sugar for z(i)(j)
allows jagged arrays

Array Implementation - Accessing a Cell

Assume group of adjacent memory cells

Calculation of cell address:

Assume A: array (L .. U) of Item
Assume Item'size = S (in bytes)
Assume A'address = A
Address of A[I] = A + S * (I - L) = A + S*I - S*L = (A - S*L) + S*I

Runtime vs compile time calculations

If A, S, and L are static, then compute at compile time: A - S*L
If all are dynamic, compute all at runtime
Array references are typically highly optimized

Two Dimenstion Array Implementation - Row and Column Major

Implementer can choose one of two ways of storing the cells:

Row major: Store first row, then second row, ... (ie a(0,0), a(0,1), a(0,2), a(1,0), ... a(2,2))

Column major: Store first column, then second column, ... (ie a(0,0), a(1,0), a(2,0), ...)

For best performance, access all elements sequentially

Thus, programmer should know how 2D arrays are stored.

Two Dimension Arrays: Accessing a Cell

Formula: eg A + (numRowsBeforeLastRow * rowSize + numElementsInLastRow) * elementSize

Is this row or column major?

If we assume lower bounds are one:

Formula for a(i, j): A + ( (i-1) * rowSize + (j-1) ) * elementSize

What about lower bounds that are equal 0?

Formula for a(i, j): A + ( i * rowSize + j) * elementSize

What might be known at compiletime? (Assume lower bounds are 1)

Formula for a(i, j): A - (rowsize-1)*elementsize + [(i * rowSize + j) * elementSize]
What about lower bounds that are not equal 0 or 1?

Static vs dynamic has a large impact on performance

Static vs dynamic has even more impact with more dimensions

To access all elements, a compiler will figure out to simply add elementsize

Array Implementation - Jagged

Array of pointers
Inefficient

Array Categories:

Categorize based on subscript range bindings, storage binding, and where storage are allocated
Categories

Static: subscript range bindings and storage allocation are static (ie as executions begins and fixed)

Fastest. But size can't change and storage can't be shared

Fixed Stack Dynamic: subscript ranges are static (must be fixed at compile time) but storage allocation is dynamic (ie on stack)

Slower than static. Performance penalty for allocation. Stack has automatic deallocation. Allows reuse of storage, but size fixed at compile time.

Stack Dynamic: subscript ranges and allocation are bound at runtime but unchanging - storage allocation is dynamic (ie on stack)

Slower than fixed stack. Performance penalty for allocation. Stack has automatic deallocation. Allows subscript ranges to be set at runtime. Allows reuse of memory.

Fixed Heap Dynamic: allocated on heap. Subscript ranges and storage are are bound at runtime and don't change

Speed comparable to stack dynamic. Performance penalty for allocation. No automatic deallocation. Allows subscript ranges to be set at runtime. Allows reuse of memory.

Heap Dynamic: allocated on heap. Subscript range and storage binding are changable

May have to pay allocation penalty many times.

Examples of Array Categories:

static: C and C++ static arrays; Ada - declared in a package
fixed stack dynamic: C, C++ nonstatic; Ada locals
stack dynamic: Ada unconstrained array parameters and declare blocks
fixed heap dynamic: C: malloc and free; C++: new and delete; Java, C#
heap dynamic: Java and C# ArrayList, Perl, Python, Javascript

SKIP FALL 2010: Array Shapes and Lifetimes

Categorize based on where and when shape and storage allocated

Shape: subscript range bindings

Shapes

global lifetime, static shape: allocated in static area
local lifetime, static shape: allocated on stack
local lifetime, shape bound when allocated: allocated on stack - amount of memory allocated not known until runtime (which complicates accessing other local variables, etc)
arbitrary lifetime, shape bound when allocated: array allocated on heap - eg java arrays (which can't change shape once allocated
arbitrary lifetime, shape bound dynamically: allocated on heap - changes in size may require reallocation and copying. eg: Perl, java ArrayList, C++ vector

Dynamic shapes require a runtime descriptor.
Static shapes allow arrays with no descriptors: compiler builds shape information into instructions.

SKIP FALL 2010: Array Lifetime, Category

Lifetime

Static/Global
Stack/Local
Arbitrary

Categories

static array: subscript range bindings and storage allocation are static - fast but not flexible
fixed stack dynamic: subscript ranges are static but storage allocation is dynamic - allows reuse of storage
stack dynamic: subscript ranges and allocation are dynamic but unchanging - flexible
fixed heap dynamic: allocated on heap ranges and storage are dynamic, but don't change
heap dynamic: on heap and changable

Examples of categories:

static array: C and C++ static arrays; Ada - declared in a package
fixed stack dynamic: C, C++ nonstatic; Ada locals
stack dynamic: Ada unconstrained array parameters ?
fixed heap dynamic: C: malloc and free; C++: new and delete; Java, C#
heap dynamic: Java, C# ArrayList, Perl, javascript

Associative Arrays

Unordered collection of data elements, each of which is indexed by a unique key

Built in to scripting languages such as Python, Perl, Ruby
Available in non-scripting languages (eg Ada, Java, C++, C#) in libraries

AKA: Map, dictionary, table, hash

Grows dynamically as needed

Operations: add, access, find, all keys, all elements

Typically implemented with a hash table

History - from world of hardware: associative memory allowed using logical address to lookup physical address

Records

Aggregation of heterogeneous data elements, each of which is accessed by a (compile time) name

Introduced in Cobol, programmers previously used multiple arrays

C, C++, C# all use structs for records

Design issue: how to do self referential declarations

Implementation:

Stored in adjacent memory locations.
A given field is always at the same offset from the beginning
Compiler generates accesses using offsets from the start of the record

Possible Operations: allocation, assignment (a:=b, and aggregate), equality

Relation to classes and objects:

class is record (and methods that operate on that type) accessed via a pointer.
Fields in record correspond to fields (ie instance and class variables) in classes
Ada uses records to implement inheritance by creating a new record type that extends an existing record type by adding fields to it

Classes and structs are related (virtually identical) in C++ and C#

Union

A type is a set of values and operations - a union type is a union of two other types (eg union of integer and float).

At any point in its lifetime, a variable of union type can only hold variables of one of the types in the union

Example - Ada Variant Records:

      type Shape is (Circle, Triangle, Rectangle);
      type Colors is ...

      type Figure(Form: Shape) is record
         
         Color: Colors;

         case Form is
            when Circle =>
               diameter: float;

            when Triangle =>
               left: integer;
               right: integer;
               angle: float;

            when Rectangle =>
               height: integer;
               width: integer;
         end case;
      end record;

      ...

      F1: Figure;     -- Compiles??
      put(F1.Color);  -- ??
      put(F1.Form);   -- ??

      F2: Figure(Form => Circle);

      F1 := (  Color => blue,
               Form => rectangle,
               Height => 11,
               Width => 22);

      F2 := F1;

C and C++ use unions

Design Issues:

Is there a flag that indicates which type is being stored (allowing RT type checking), or do you assume the programmer will only use the correct operation. (ie free or discriminated union)

No tag in C and C++

If there a flag can it be changed independently from the data? (Ada: can only change tag if also assign data)

Integrated with records?

More information on Variant Records: Ada's typesafe version of Union

Pointer Types

Pointers have memory addresses (or nil) as values

Operations

Assign a variable's address to a variable

p = X'Access -- Ada
p = &X // C

Dereference (ie access value pointed to)

*p in C
p.all in Ada

Design Issues:

Type checking
Can you point to elements on the stack?
Can you do arithmetic

C: p = p + 100
Others: no

Uses

Flexibility: C out parameters
Dynamic storage management

Use dynamically allocated memory
x := new pair()
x = malloc(100)

Reference Types

Implicit pointers

Java

Box b = new Box()
b.l = 3

C++'s approach to implementing in/out parameters

C++ Example:

neg(int x, int &y)
{
    y = -x;
}

main(){
    int i = 3, j;

    neg(i, j);
    cout << j ; // -3
}

No pointer arithmetic
Automatic dereference

Heap Management

Two approaches: Single size and variable size cells

Focus on single size cells

Allocation and deallocation - Free list: simply keep a linked list of available cells

Allocation is simple: remove cells when needed
Deallocation: simply add cells back to free list? Problem is when to add back

Problems and solutions:

Dangling references - solutions: tombstone, lock and key
Garbage - Solutions: reference count and garbage collection

Reference count: Count number of references to a cell and undelete when count reaches 0
Garbage collection:

Algorithms: mark and sweep, copy collection
Problems: performance

Java:

Early: mark and sweep; later incremental and generational
Separate algorithms for different age objects
Can choose algorithms (eg real time)

Arrow Types: Functions

Types for functions:
Example in Ada:

    function Square(i: integer) return integer; 

    type FunctionPointer is access function f(i: integer) return integer;

    v : FunctionPointer := Square'access;

    j: integer := v.all(3);

Mathematical Definitions

Array and Arrow: Mapping
Record: Cross product
Union: Union

Evolution of Features

Integer and Float
Fixed
Boolean:

Algol 60 (introduced type Boolean)
C uses numeric values

Character
Strings
Enumerated type (introduced by Pascal)
Subtype (Pascal)
Arrays (Fortran I)
Associative Arrays (?)
Records (Cobol)
Union (Free: Fortran; Discriminated: Algol 68)
Pointer: ?