CS21 Lab 8: Auto-correct

Due Saturday Night, April 4

Run update21, to create the cs21/labs/08 directory (and some additional files). Then cd into your cs21/labs/08 directory and create the python programs for lab 8 in this directory

Introduction
For this lab, you will write a program called spellchecker.py that asks the user for a file name and then checks that each word in the file is spelled correctly. For words that are misspelled, your program will try to auto-correct by suggesting replacement words.

Here's an example:

$ cat letter.txt 

Dear Sam,

How are you?  I'm fine, but I misss seeing you more ofetn.  
Hope alk is well with your family, espceially your brother. 
I'll be stoping by next Saturddy when I get home from 
Swarthmore.  

Love, 

Alice
$ python spellchecker.py 

Enter name of file to spell check: letter.txt

Unknown words and suggestions:

misss : miss
ofetn : often
alk : all
espceially : especially
stoping : storing
Saturddy : Saturday
Swarthmore : Swarthmore


Getting Started
As a first step, begin by writing a function that can read in the file to spell check and generate a list of the individual words in the file. Your function should remove leading and trailing spaces, newlines and punctuation from each word, but don't worry about removing the apostrophe from contractions like I'm, don't, or I'll. Have your function return a list of words.

For the dictionary of valid words, use the file /usr/share/dict/words that we used in class. Read this file to generate a list of all valid words. Don't worry about removing proper nouns or contractions. Just strip any whitespace at the begin or end.

With the list of words from your text file and the list of dictionary words, generate a list of unknown and possibly misspelled words. For each word in the original text file, you should search for the word using binary search in the list of valid words. Once you have identified the misspelled words, you can move on to the next part of suggesting corrections.

Generating Alternatives

Let's begin by looking at how we generate alternatives for misspelled words. We'll use "seplling" as the running example. This word is not in the dictionary, so we need to propose alternatives. Our alternatives will be generated using three different methods: deletion, transposition, and substitution. Each is explained below:

Deletions

We form alternatives of the word by generating all possible single-letter deletions. Using our example, "seplling", we would generate: eplling, splling, selling, sepling, sepling, sepllng, sepllig, and sepllin. Each alternative differs from the original word by only one letter. For a word of length n, there are n alternates we can form by deleting a single character; here, the word (seplling) was 8 letters long and there are 8 alternatives we can form using deletion.

Transpositions

For transpositions, we swap two adjacent letters in the word. Using our example, "seplling", we would generate: esplling, spelling, selpling, seplling, seplilng, sepllnig and sepllign. For a word of length n, there are n-1 alternates we can form by transposing a pair of characters; here, the word was 8 letters long and there are 7 alternatives we can form using transposition.

Substitutions

We form alternatives of the word by generating all possible single-letter substitutions. For each letter in the original word, we substitute a letter from the alphabet (abcd...xyz) to replace it. There are too many possible substitutions to list them all. Using our example, "seplling", we would generate:

aepllingsapllingsealingsepalingseplaingsepllangseplliagsepllina
bepllingsbpllingseblingsepblingseplbingsepllbngsepllibgsepllinb
cepllingscpllingseclingsepclingseplcingsepllcngsepllicgsepllinc
depllingsdpllingsedlingsepdlingsepldingseplldngsepllidgsepllind
eepllingsepllingseelingsepelingsepleingsepllengseplliegseplline
........................
yepllingsypllingseylingsepylingseplyingsepllyngseplliygseplliny
zepllingszpllingsezlingsepzlingseplzingsepllzngsepllizgsepllinz

Note that many of these alternatives are not actual words. Only a few such as selling via deletion of a p or spelling via transposition of ep are valid words.

You program should contain one or more functions that generate possible alternatives for a given word and check which of the alternatives are valid words.

Ranking alternatives

For the "seplling" example, we find that only 2 of the alternatives are legal words in our dictionary: selling, and spelling. Your program should suggest only one of these two alternatives, by selecting which word is more frequently used.

To help answer this question, we have provided you with a new dictionary (/usr/local/doc/wordcts.txt) which contains each of the words from /usr/share/dict/words augmented with a frequency with which this word occurred in a very large document collection. Here are the first 10 lines of that file:

a 7841087012
aardvark 93030
aardvarks 11762
abaci 5391
aback 272920
abacus 149548
abacuses 4473
abaft 19188
abalone 237386
abalones 6544
Very frequently used words such as "a" occurred very often (7,841,087,012 times). Common words such as "abacus" occurred more frequently (149,548 times) than uncommon words such as "abaci" (5,391 times). We will use the frequency with which words occurred to rank the alternatives and we will choose the most frequently occurring word as the proposed alternative.

In our running example, we want to choose the alternative of "seplling" that is most frequent. Consulting wordcts.txt, we find that "selling" is the most frequent:

selling 29092601
spelling 5373561
Therefore our program should choose selling as the proposed alternative.

To implement this ranking algorithm, you will first need to read the file /usr/local/doc/wordcts.txt into a list of [word, frequency] lists:

[['a', 7841087012], ['aardvark', 93030], ['aardvarks', 11762], ...]
You will then need to search this list of [word, frequency] pairs for each of your proposed alternatives. This means for "seplling" you will need to perform two searches in this list. You will want to use binary search since the list of words is in sorted order. You will then need to choose the word that has the highest frequency of all the proposed alternatives.

Note that each item in this list is a [word, frequency] pair. When searching for selling, you cannot directly compare the word "selling" to the list ["selling", 29092601]. You will need to write a separate binary search function that is very similar to the search used in the first part of this lab for searching /usr/share/dict/words. Since we are only interested in the frequency of the word, you can return the frequency, e.g., 29092601 instead of the position in this search, and return -1 if the word is not found.

Combine this new binary search code with the rest of your program so that for each misspelled word, your program suggests the best alternative word. If there are two or more alternative words with the same highest frequency, you can return any of the matching valid words with this frequency

Note that for some misspelled words, there will be no alternatives that occur in the dictionary. For example, if you type in "Swarthmore", there are no alternatives which occur in the word list. In cases like this, simply return the original word, "Swarthmore".

Requirements
ADDITIONAL EXAMPLES
Helpful tips


Extra Challenges...

These are not part of the lab assignment, but just "fun" extra things you can add to your program. Please don't try these until you have completed the above assignment. And they are just for fun -- no bonus points or extra credit.

If you attempt the extra challenges, please save your changes in spellcheck_extra.py instead of spellcheck.py

$ cp spellcheck.py spellcheck_extra.py
$ vim spellcheck_extra.py


Submit

Once you are satisfied with your program, hand it in by typing handin21 in a terminal window.