ITEC 360: Program 3

Due by 11:59:59 p.m., Saturday, May 2, 2020

Late programs not accepted.

Example submit command: `submit itec360-01 README.txt editdis.adb anyotherfiles.ads`

DRAFT

Last modified:
Updates:

2020 May 01 05:41:50 PM: Corrected error in first calculation
2020 Apr 25 11:21:28 PM: Set due date for 2020

You can see the assignment checker for this program at itec-noki01/360.

Many search engines give you a list of words that are close to your search terms. This feature can be implemented by listing the words that are within a certain edit distance of the search terms. For this assignment, the edit distance from word X to word Y is defined to be the minimal cost sequence of changes that transforms X into Y. Possible changes, and example associated costs, include

copy - cost 0
deletion - cost 1
insertion - cost 1
substitution - cost 1

For example, You can transform "computer" into "science" as follows:

computer
scomputer  insertion
scomputer  copy
scimputer  substitution
scieputer  substitution
scienuter  substitution
sciencter  substitution
sciencer   deletion
sciencer   copy
science    deletion

Thus, the edit distance from computer to science is (2 * 0) + (2 * 1) + (1 * 1) + (4 * 1) = 7 n`&nbs;;&nbs;; ~~(1 * 0) + (2 * 1) + (1 * 1) + (4 * 1) = 7~~. With a different set of weights:

copy - cost 1
deletion - cost 3
insertion - cost 2
substitution - cost 4

the edit distance for this same pair of words is, somewhat surprisingly, defined by the same set of changes, with a total cost of (2 * 1) + (2 * 3) + (1 * 2) + (4 * 4) = 26. With the weights:

copy - cost 1
deletion - cost 3
insertion - cost 4
substitution - cost 2

the total cost is (1 * 1) + (1 * 3) + (0 * 4) + (6 * 2) = 16.

Search engines might select words with a low edit distance when making suggestions.

Assignment:
Your assignment is to use a dynamic programming algorithm to efficiently calculate the edit distance between pairs of words. A word is defined to be a maximal sequence of letters from upper and lower case 'a' to 'z'. You will use your algorithm to find and mark all of the words in a body of text that can be transformed into a given, specified word in less than or equal to a specified permissible edit distance. You do not have to search for substrings that are identical to the specified word. Your program should ignore the case of letters when calculating edit distance, but not when producing output.

Input and output:
The first line of the input will be only a single, left-justified word, known as the search word. The search word will consist of only letters from upper and lower case 'a' to 'z'.

The second line of the input will be an Natural, the maximum permissible edit distance.

The third line of the input will be four Natural numbers, which represent the weights for copy, deletion, insertion, substitution, respectively.

The remainder of the input will be a body of text containing one or more words that are separated by punctuation marks or white space (ie spaces, tabs, and end of line characters).

The output from the program is an EXACT copy of the input, with two exceptions. The first exception is that parentheses are added around each word within the body of text that can be transformed into the search word within the permissible edit distance. The second exception is the formatting of the distance and weights: The maximum distance and the operation weights should be printed without leading zeros and they should be left justified, with the weights separated by a single space.

Your program should read from standard input and write to standard output.

Is Edit Distance Commutative: Notice that edit distance calculations are not commutative. The distance from "computer" to "science" may be different from the distance from "science" to "computer". You may want to ponder the circumstances under which the calculation is commutative, although doing so is not necessary for the assignment. You will want to make sure that you are doing your calculation in the correct order.

Sample run:
Example standard input:

lovely
2
0 1 1 1
You have always been
my love my love, for the
love I love is lovely
as love itself.  Love? 
did you hang up?

How many "loves" in this text?

Expected output for example input:

lovely
2
0 1 1 1
You have always been
my (love) my (love), for the
(love) I (love) is (lovely)
as (love) itself.  (Love)? 
did you hang up?

How many "(loves)" in this text?

Readme: Please submit file README.txt that contains instructions to compile and run your program. Also describe how well your program works.

Documentation: You will, of course, want to follow good style in writing your program. Please include the following two items in the documentation of your main routine:

Instructions on compiling and running your program. Also include these instructions in your README file.
The recurrence equation for the solution that you are implementing, including initial conditions.
Information on any references that you use in coming up with your solution.

Hint 1: If you find it difficult to break the program into words and to copy the input to the output, then I highly suggest that you design your algorithm so that it works as a finite state machine. Your machine will have two main states: one for processing inside words and one for processing outside words. This pseudocode, which doesn't necessarily adequately consider eol and eof, gives the general idea:

type States is (outside_a_word, inside_a_word);
state: States := outside_a_word
loop until eof
   get(c);
   case state is
      when outside_a_word =>
         if c is not a letter
            put(c)
         else  -- c is a letter
            token := c
            state := inside_a_word
         end if
      when inside_a_word =>
         if c is a letter
            token := token + c
         else  -- c not a letter
            check word and print as is or with ()
            state := outside_a_word
         end if
   end case
end loop

In summary, you should carefully consider what your program does to process a character when your program is in either of these two states, and then code your algorithm so that it handles the character appropriately based based upon which of the two states it is in. This pseudocode implies that you process the input one character at a time; however, it is up to you whether your input is read one character at a time or whether the input is read one line at a time (and processed one character at a time).

Hint 2: First solve the problem with all costs as 1.

Submission and language: You should use a command similar to the one above to submit your program. Include instructions on how to compile and run your code in your main routine and in your README file. If you want to use any language other than Ada, C, C++, or Java, please discuss it with me first.

Credits: This assignment relies heavily on an assignment developed by Dr. Uppuluri. The sample input above comes from page 403 of The Algorithm Design Manual by Steven S. Skeina and from the Stoney Brook Algorithm Repository.

ITEC 360: Program 3

Due by 11:59:59 p.m., Saturday, May 2, 2020

Late programs not accepted.

Example submit command: submit itec360-01 README.txt editdis.adb anyotherfiles.ads

DRAFT

Example submit command: `submit itec360-01 README.txt editdis.adb anyotherfiles.ads`