Chapter 6: Data Types
Overview
- Importance of types
- Primitive types
- Definitions
- Numeric
- Non-numeric
- Strings
- Primitive?
- Static?
- Operations
- Implementation
- User defined ordinal
- Arrays
- Definition
- Arrays and bounds checking
- Categories
- Initialization and Operations
- Implementatino
- Associative arrays
- Records
- Definition
- Accessing fields
- Implementation
- Union Types
- Definition
- Free vs tagged
- Implementation
- Pointer Types
- Definition
- Operations
- Problems
- Languages: Ada, C, C++
- Reference types
- Pointer Implementation
- Evolution
Types are Important
- Types allow thinking at problem level rather than machine level
- Algorithms + Data Structures = Programs (a book by N. Wirth)
- Programmer needs
- Primitive types (eg integer)
- User defined types
- To be able to implement ADTs (Ch. 11)
- Way of organizing data: Modules and classes (Chapter 11)
- A programmer should understand the language's type system!
Chapter 6 Goals
- We will study the following:
- The types provided by different languages
- Design issues involved with different types
- Choices made by common languages
- How types are implemented
Primitive Types
- We study primitive and non-primitive types
- The term primitive is used with several meanings:
Primitive
|
Non-Primitive
|
Book's def: Not defined in terms of
other types
|
Defined
in terms of other types
|
Provided by language
|
User defined
|
Scalar: has hierarchical structure
|
Structured or Composite: has hierarchical structure
|
Variable holds value
|
Reference type: variable holds a
pointer to a value (eg Java)
|
Numeric Types
- Integer
- Signed: 2's complement most common
- Unsigned integers allowed?
- Size part of language definition
- Floating Point:
- Floating point implementation is hard
- Problems include how to do rounding and truncation
- Extra digits needed in calculations when aligning exponents
- IEEE 754 Floating point standard resolved many issues (but issues remain)
- Standard specifies how to handle overflow, underflow, +/-
infinity, NaN, rounding
- 1 bit sign, 8 bit exponent , 23 bit significand
- Largest number for single precision: 2.0 * 1038,
double precision: 2.0 * 10308
- Smallest number for single precision: 2.0 * 10-38,
double precision: 2.0 * 10-308
- Exponent is biased (not 2's complement) notation to simplify
sorting (goal: less than comparison for integers also works for
floating
point number)
- Significand is normalized and first one bit is assumed
- Example: 3/8 = 1/4 + 1/8 = 0.01 + 0.001 = 0.0110 = .1100 * 2-1
is written as .1000 * 2-1
- Problem: how to represent 0.000 Answer: Special value
of 32 zeros
- Size of exponent and significand is tradeoff between accuracy
and range
- Fixed point: number of decimal (or binary) places of precision fixed
- Good for banking
- Cobol, Ada, PL/1
- Implement as
- integer with divisor: less space and faster ops
- Ada allows both decimal or binary fixed point
- Rational:
- Separate numerator and denominator
- Exact arithmetic: (1.0 / 3.0) * 3.0 = 1.0 ??
- LISP ratios
- Ada compile time constants: pi: constant := 3.14... to 50 places
Non-Numeric Primitive: Boolean
- Issue: Separate type or designate values as true/false
- Java, Ada, C# have a separate type: gives better type checking
- C, C++, Perl, Python, ... designate numers, strings, pointers, ... as
true/false
- Expressive, but type checking catches fewer errors
- Not normally stored as a bit - Why?
- Issue: Built-in type or standard type
- Introduced by Algol 60
Non-Numeric Primitive: Characters
- Three common sizes:
- ASCII (7 bit) or ISO 8859-1 (8 bit)
- Unicode (16 bit)
- First 128 characters are identical to ASCII
- Java character
- Ada Wide_Character
- Python, Perl, C# and Javascript use Unicode
- Wide Unicode (32 bit)
- Issue for 16 and 32 bit: big-endian or little endian (ie are bytes ordered
high to low or low to high addresses)
- Usually a primitive type (ie Built-in type)
- Python: String of length 1
- Ada: Enumerated type in package Standard
Strings
- Issue: Is a string
- an array of characters (C, C++, Ada)
- or a special type (Java, C#, Javascript, PPR (Perl, Python, Ruby))
- Issue: String Length
- Static (ie Fixed): Ada, Java, Python, C# .NET, C++ Standard library
- May be immutable, as in Java
- Limited Dynamic (ie Bounded): Ada, C, C++
- Dynamic (ie Variable) : Ada, C++ SL, Javascript, Perl
- Implementation
- Static:
- Header record or object containing length and address or array
- Limited Dynamic:
- Array of characters with end marker (C: 0 byte)
- Header record with maximum length, current length, and address or array
- Dynamic:
- Linked list
- Dynamic allocation:
- Reallocate when size changes
- Amortized (ie header with current length, allocate block, reallocate when block filled)
- Evaluation:
- Static: faster, but not flexible
- Dynamic: slower, but flexible
- Limited Dynamic: compromise speed and flexibility
- Security: 0-terminated is dangerous
- Reference or value semantics
- Where allocated
- Static: C string literals
- Stack: Ada fixed and bounded length
- Heap: Java strings, variable length strings
- Operations
- Allocation
- Declaration: implicit or explicit:
- Ada: foo: String(1..10); foo2: String := "Hi Mom";
- java String s = "abc"; String s = new String("ABC");
- C: string s = "abc"; # allocates 4 bytes
- C: string s = malloc(10); # allocate room for 9 char string
- Index, slice/substring
- Substitution/changing length (Java Strings are immutable)
- Pattern matching
User-Defined Ordinal Types
- Ordinal Type: values can be ordered (ie has operations first, last, next [at least implicitly])
- Enumerated types
- Subrange types
- Enumerated types:
- Define a new type with a new set of values represented by
identifiers
- Ada, C#, introduced by Pascal, added to Java 1.5
- Operations
- declare
- assignment
- relational
- array indices
- loop index over subrange
- Issue: Strongly typed? (ie no arithmetic such as blue + 1)
- Ada, Java, C#: Strongly typed
- C: int i = blue; mycolor = blue + 1; # all okay
- C++: int i = blue; # okay
- C++: mycolor = 111; # not okay
- C++: mycolor = (color) 111; # okay
- Enhance readability and type checking
- Implemented as integers
- Constrast with java (pre 1.5): public static final RED = 3;
- Trends: returning to favor, but scripting langs don't use
- Subrange types
- Subrange of an ordinal type
- Subtype compatible with parent type
- Compile time and runtime type checking
- Improves reliability and readability
- Introduced by Pascal
- Ada only common language
Array Definition
- Mapping from ordinal type (normally integer) to some other type
- Homogeneous aggregate (ie all elements are of same type)
- Stored contiguously
Array Subscripts
- Issue: What can be a subscript?
- Any ordinal type?
- Integers only?
- Start at 0 or any range?
- How are subscripts denoted:
- () - not distinguishable from a function (conceptually the same as a function)
- [] - distinguishable from a function
- Are subscripts checked at runtime:
- Yes: Ada, Java, C#, Python (Benefit?)
- No: C, C++, Perl, Fortran (What happens if out of bounds? Benefit?)
- perl checks if -w is used, otherwise not
- Is a check needed in all situations?
Array Initialization
- C, C++, C#, Java:
- int list [] = {1, 2, 3};
- compiler cannot check size
- allows: int list [] = {11,12, 11, }; // Size?
- Ada:
- list: array (0..2) of integer := (11, 22, 33);
- checks size - redundant information
- list: array (0..2) of integer := (1 => 12, others => 11l);
- aggregate value, keyword or positional values, others
Array Operations
- Assignment, equality
- reference vs value semantics
- Aggregate value
- Mathematical operations: APL and Fortran
- Slicing: Ada, Fortran
Multidimensional Array
- Rectangular: Ada, C#, Fortran - store continguous rows (or rows,
see below)
- Example: a: array (1..3, 1..3) of integer; a(i, j) := 0;
- Array of arrays: eg Ada, Java, C, C++, C# - store column of pointers
- int[][] a = new int[3]; a[0] = new int[3]; a[1] = new int[3];
- access with a[i][j]
- Some languages use z(i, j) as syntactic sugar for z(i)(j)
- allows jagged arrays
Array Implementation - Accessing a Cell
- Assume group of adjacent memory cells
- Calculation of cell address:
- Assume A: array (L .. U) of Item
- Assume Item'size = S (in bytes)
- Assume A'address = A
- Address of A[I] = A + S * (I - L) = A + S*I - S*L = (A - S*L) + S*I
- Runtime vs compile time calculations
- If A, S, and L are static, then compute at compile time: A - S*L
- If all are dynamic, compute all at runtime
- Array references are typically highly optimized
Two Dimenstion Array Implementation - Row and Column Major
- Implementer can choose one of two ways of storing the cells:
- Row major: Store first row, then second row, ... (ie a(0,0), a(0,1),
a(0,2), a(1,0), ... a(2,2))
- Column major: Store first column, then second column, ... (ie a(0,0),
a(1,0), a(2,0), ...)
- For best performance, access all elements sequentially
- Thus, programmer should know how 2D arrays are stored.
Two Dimension Arrays: Accessing a Cell
- Formula: eg A + (numRowsBeforeLastRow * rowSize + numElementsInLastRow) * elementSize
- Is this row or column major?
- If we assume lower bounds are one:
- Formula for a(i, j): A + ( (i-1) * rowSize + (j-1) ) * elementSize
- What about lower bounds that are equal 0?
- Formula for a(i, j): A + ( i * rowSize + j) * elementSize
- What might be known at compiletime? (Assume lower bounds are 1)
- Formula for a(i, j): A - (rowsize-1)*elementsize + [(i * rowSize + j) * elementSize]
- What about lower bounds that are not equal 0 or 1?
- Static vs dynamic has a large impact on performance
- Static vs dynamic has even more impact with more dimensions
- To access all elements, a compiler will figure out to simply add elementsize
Array Implementation - Jagged
- Array of pointers
- Inefficient
Array Categories:
- Categorize based on subscript range bindings, storage binding, and where storage are allocated
- Categories
- Static: subscript range bindings and storage allocation are static (ie as
executions begins and fixed)
- Fastest. But size can't change and storage can't be shared
- Fixed Stack Dynamic: subscript ranges are static (must be fixed at compile time)
but storage allocation is dynamic (ie on stack)
- Slower than static. Performance penalty for allocation.
Stack has automatic deallocation.
Allows reuse of storage, but size fixed at compile time.
- Stack Dynamic: subscript ranges and allocation are bound at runtime but
unchanging - storage allocation is dynamic (ie on stack)
- Slower than fixed stack. Performance penalty for allocation.
Stack has automatic deallocation.
Allows subscript ranges to be set at runtime. Allows reuse of memory.
- Fixed Heap Dynamic: allocated on heap. Subscript ranges and storage are
are bound at runtime and don't change
- Speed comparable to stack dynamic. Performance penalty for allocation.
No automatic deallocation.
Allows subscript ranges to be set at runtime. Allows reuse of memory.
- Heap Dynamic: allocated on heap. Subscript range and storage binding
are changable
- May have to pay allocation penalty many times.
Examples of Array Categories:
- static: C and C++ static arrays; Ada - declared in a package
- fixed stack dynamic: C, C++ nonstatic; Ada locals
- stack dynamic: Ada unconstrained array parameters and declare blocks
- fixed heap dynamic: C: malloc and free; C++: new and delete;
Java, C#
- heap dynamic: Java and C# ArrayList, Perl, Python, Javascript
SKIP FALL 2010: Array Shapes and Lifetimes
- Categorize based on where and when shape and storage allocated
- Shape: subscript range bindings
- Shapes
- global lifetime, static shape: allocated in static area
- local lifetime, static shape: allocated on stack
- local lifetime, shape bound when allocated: allocated on stack
- amount of memory allocated not known until runtime (which complicates
accessing other local variables, etc)
- arbitrary lifetime, shape bound when allocated: array allocated
on heap - eg java arrays (which can't change shape once allocated
- arbitrary lifetime, shape bound dynamically: allocated on heap
- changes in size may require reallocation and copying. eg: Perl, java
ArrayList, C++ vector
- Dynamic shapes require a runtime descriptor.
- Static shapes allow arrays with no descriptors: compiler builds
shape information into instructions.
SKIP FALL 2010: Array Lifetime, Category
- Lifetime
- Static/Global
- Stack/Local
- Arbitrary
- Categories
- static array: subscript range bindings and storage allocation
are static - fast but not flexible
- fixed stack dynamic: subscript ranges are static but storage
allocation is dynamic - allows reuse of storage
- stack dynamic: subscript ranges and allocation are dynamic but
unchanging - flexible
- fixed heap dynamic: allocated on heap ranges and storage are
dynamic, but don't change
- heap dynamic: on heap and changable
- Examples of categories:
- static array: C and C++ static arrays; Ada - declared in a
package
- fixed stack dynamic: C, C++ nonstatic; Ada locals
- stack dynamic: Ada unconstrained array parameters ?
- fixed heap dynamic: C: malloc and free; C++: new and delete;
Java, C#
- heap dynamic: Java, C# ArrayList, Perl, javascript
Associative Arrays
- Unordered collection of data elements, each of which is indexed
by a unique key
- Built in to scripting languages such as Python, Perl, Ruby
- Available in non-scripting languages (eg Ada, Java, C++, C#) in libraries
- AKA: Map, dictionary, table, hash
- Grows dynamically as needed
- Operations: add, access, find, all keys, all elements
- Typically implemented with a hash table
- History - from world of hardware: associative memory allowed using logical address to
lookup physical address
Records
- Aggregation of heterogeneous data elements, each of which is
accessed by a (compile time) name
- Introduced in Cobol, programmers previously used multiple arrays
- C, C++, C# all use
struct
s for records
- Design issue: how to do self referential declarations
- Implementation:
- Stored in adjacent memory locations.
- A given field is always at the same offset from the beginning
- Compiler generates accesses using offsets from the start of the record
- Possible Operations: allocation, assignment (a:=b, and
aggregate), equality
- Relation to classes and objects:
- class is record (and methods that operate on that type)
accessed via a pointer.
- Fields in record correspond to fields (ie instance and class
variables) in classes
- Ada uses records to implement inheritance by creating a new
record type that extends an existing record type by adding fields to it
- Classes and structs are related (virtually identical) in C++ and C#
Union
- A type is a set of values and operations - a union type is a union of two other types
(eg union of integer and float).
- At any point in its lifetime, a variable of union type can only hold variables of
one of the types in the union
- Example - Ada Variant Records:
type Shape is (Circle, Triangle, Rectangle);
type Colors is ...
type Figure(Form: Shape) is record
Color: Colors;
case Form is
when Circle =>
diameter: float;
when Triangle =>
left: integer;
right: integer;
angle: float;
when Rectangle =>
height: integer;
width: integer;
end case;
end record;
...
F1: Figure; -- Compiles??
put(F1.Color); -- ??
put(F1.Form); -- ??
F2: Figure(Form => Circle);
F1 := ( Color => blue,
Form => rectangle,
Height => 11,
Width => 22);
F2 := F1;
C and C++ use unions
Design Issues:
- Is there a flag that indicates which type is being
stored (allowing RT type checking), or do you assume the programmer
will only use the correct operation. (ie free or discriminated union)
- If there a flag can it be changed independently
from the data? (Ada: can only change tag if also assign data)
- Integrated with records?
More information on Variant Records:
Ada's typesafe version of Union
Pointer Types
- Pointers have memory addresses (or nil) as values
- Operations
- Assign a variable's address to a variable
- p = X'Access -- Ada
- p = &X // C
- Dereference (ie access value pointed to)
- Design Issues:
- Type checking
- Can you point to elements on the stack?
- Can you do arithmetic
- C: p = p + 100
- Others: no
- Uses
- Flexibility: C out parameters
- Dynamic storage management
- Use dynamically allocated memory
- x := new pair()
- x = malloc(100)
- Reference Types
- Implicit pointers
- Java
- Box b = new Box()
- b.l = 3
- C++'s approach to implementing in/out parameters
Heap Management
- Two approaches: Single size and variable size cells
- Focus on single size cells
- Allocation and deallocation - Free list: simply keep a linked
list of available cells
- Allocation is simple: remove cells when needed
- Deallocation: simply add cells back to free list? Problem is when to
add back
- Problems and solutions:
- Dangling references - solutions: tombstone, lock and key
- Garbage - Solutions: reference count and garbage collection
- Reference count: Count number of references to a cell and undelete when
count reaches 0
- Garbage collection:
- Algorithms: mark and sweep, copy collection
- Problems: performance
- Java:
- Early: mark and sweep; later incremental and generational
- Separate algorithms for different age objects
- Can choose algorithms (eg real time)
Arrow Types: Functions
- Types for functions:
- Example in Ada:
function Square(i: integer) return integer;
type FunctionPointer is access function f(i: integer) return integer;
v : FunctionPointer := Square'access;
j: integer := v.all(3);
Mathematical Definitions
- Array and Arrow: Mapping
- Record: Cross product
- Union: Union
Evolution of Features
- Integer and Float
- Fixed
- Boolean:
- Algol 60 (introduced type Boolean)
- C uses numeric values
- Character
- Strings
- Enumerated type (introduced by Pascal)
- Subtype (Pascal)
- Arrays (Fortran I)
- Associative Arrays (?)
- Records (Cobol)
- Union (Free: Fortran; Discriminated: Algol 68)
- Pointer: ?