Lab 04

Due 11:59pm Wednesday October 3, 2018

This week, you’ll be writing a class from scratch that can find the weighted minimum edit distance between two strings, and can also display the resulting character alignments. It is important that you follow the API below as the code for Lab 05 will assume your class is written as explained below.

When you’re done, you’ll be able to run code like the following:

my_aligner = EditDistanceFinder()
my_aligner.train("/data/spelling/wikipedia_misspellings.txt")

dist, alignments = my_aligner.align("cought","caught")
print("Distance between 'cought' and 'caught' is", dist)
my_aligner.show_alignment(alignments)
print()
dist, alignments = my_aligner.align("caugt","caught")
print("Distance between 'caugt' and 'caught' is", dist)
my_aligner.show_alignment(alignments)

which should generate the output:

Distance between 'cought' and 'caught' is  0.98870
Observed Word: c o u g h t
Intended Word: c a u g h t
Distance between 'caugt' and 'caught' is 0.83541
Observed Word: c a u g % t
Intended Word: c a u g h t

Your EditDistanceFinder class should have one data member:

You are welcome to include other data members if they help with your implementation, but a user of your class should not need to know about them.

Over the course of the assignment, you will implement the following member functions:

More details about each method are included below. The description below is a suggested order for writing each of the methods, but you should feel free to develop in another order if it makes more sense to you.

__init__

In Python classes, __init__ is run when an object is first created. In particular, for your EditDistanceFinder, __init__ should initialize the probs variable to an empty defaultdict of defaultdicts of floats. You can read the documentation for defaultdict if you are unfamiliar with it. You will want to include the line ‘from collections import defaultdict’ at the top of your program.

ins_cost(observed_char)

This method should take a single character as input, and should return a cost (between 0 and 1) of inserting that character. You should model the cost of inserting character c as 1-p(c), where p(c) is the probability of observing c when nothing was intended, which should be stored as probs['%'][c].

del_cost(intended_char)

This method should take a single character as input, and should return a cost (between 0 and 1) of deleting that character. You should model the cost of deleting character c as 1-p(c), where p(c) is the probability of observing nothing when c was intended, which should be stored probs[c]['%']

sub_cost(observed_char, intended_char)

This function should take two characters as input, and should return a cost (between 0 and 1) of replacing the observed character with the intended character. If c1==c2, it should return 0. Otherwise, you should model the cost of replacing character intended_char with character observed_char as 1-p(observed_char|intended_char), where p(observed_char|intended_char) is the probability that the character observed is observed_char given that the original character was intended_char. This value should be stored as probs[intended_char][observed_char].

align(observed_word, intended_word)

This method is where the heart of the minimum edit distance functionality will live. It’s up to you whether all of the functionality lives entirely in this method, or you decide to break it into one or more helper functions, but users of your class should only need to know about the align method.

This method takes two words as input. It returns a distance (as a float) and the corresponding character alignments (as a list of tuples of characters).

Using the example from above, before training costs, my_aligner.align("caugt","caught") should return (1.0, [('c', 'c'), ('a', 'a'), ('u', 'u'), ('g', 'g'), ('%', 'h'), ('t', 't')]), since the misspelling caugt was the result of the deletion of a single h from the intended string. Note that we will use the percent sign to indicate an empty character, so deleting a c will show up in the alignment as (‘%’,c) and inserting c show up in the alignment as (c,’%’).

Careful: Deleting the letter ‘h’ shows up in the alignment as (‘%’,’h’), but is stored in the probability matrix as probs['h']['%'] since ‘h’ was intended and ‘%’ was observed.

I recommend that you use a numpy matrix to store your cost table. You can initialize an M by N matrix of zeros with numpy.zeros((M,N)).

You may want to use a second numpy matrix to store your backtraces, but that design decision is up to you.

This is likely to be the most difficult part of this week’s assignment, and it’s a place where it’s easy to make off-by-one type errors. I recommend that you draw pictures/figures/diagrams to check indices, step through simple examples, and use any other strategies you’re familiar with to build your solution in a structured manner.

show_alignment(alignments)

This method should take the alignments returned by align and print them in a friendly way. The first line should contain “Observed Word:” followed by all of the first characters in the alignment, separated by spaces. The second line should contain “Intended Word:” followed by all of the second characters in the alignment, separated by spaces.

Once this is done, I recommend that you pause and write several test cases for your align method, using show_alignment to visually check the result of your alignment algorithm.

train, train_costs and train_alignments

These method all interact with each other, so it’s a bit tricky to decide which one to implement first.

We will start with the train method. The train method should take the name of a file (in our case, /data/spelling/wikipedia_misspellings.txt, which came from this list). Each line of the file contains a common observed misspelling, a comma, and the intended spelling. train should read in the file and split it into a list of tuples, e.g.

[(observed1, intended1), (observed2, intended2), ...]

train will then iteratively call train_alignments and train_costs. But for now, just have it call train_alignments with the list of misspellings that you read in from the file.

Now turn to train_alignments, which should take a list of misspellings like the one you just created. The method should call align on each of the (observed, intended) pairs, and should return a single list with all of the character alignments from all of the pairs.

Go back to your train method, and save the result of train_alignments to a local variable. Pass that list of alignments to a call to train_costs.

Now turn to train_costs, which takes a list of character alignments and uses it to estimate the likelihood of different types of errors. You will want to count (the class collections.Counter may be useful here) the number of times that each character intended_char is aligned to the character observed_char. To update your probs variable, each of those counts should be normalized by the total number of times that each character was intended, which will ensure that we have valid probability distributions.

Careful: Be sure you make a new self.probs inside train_costs to be certain that any value that should be zero after updating the probabilities doesn’t have some “leftover” value from the previous iteration instead.

Finally, go back to the train method and update it to repeatedly call train_alignments and train_costs until your model converges. You’ll know the model converges when the alignments don’t change from one iteration to another; that shouldn’t take more than 10 iterations for our data.

Test Cases

Once all of your methods are written, update and add to any test cases you wrote along the way to demonstrate the performance of your alignment class.

Questions

In Writeup.md, answer the following questions:

  1. Explore the behavior of the alignments you get for a variety of word lengths and types of errors. Comment on what your model does well, and what it still doesn’t do very well.
  2. Which character(s) have the highest probability of being inserted? Does this surprise you?
  3. Which character(s) have the highest probability of being deleted? Does this surprise you?
  4. Which character(s) have the highest probability of being substituted for something other than itself or '%'? Is there a letter x that that stands out? Think of this in both directions, e.g. prob[x][y] and prob[y][x]. Does this surprise you? Why?
  5. One common type of misspelling is related to how close two keys are on the keyboard. By examining your substitution probabilities, what evidence can you find to support or refute that as a source of the errors in our training data?
  6. What limitations does the model you trained have? There are multiple kinds of spelling mistakes that your model can not accurately account for. Which ones? Why? What would you need to add to your system in order to handle those?
  7. In fact, the model you trained vastly over estimates the probability of insertions, deletions, and substitutions. What is it about the data we used to train the model that would result in this over-estimate? What might be a more “fair” source of data to estimate the probabilities? What barriers might you run into if you wanted to train in a more justified way?