CS21: Lab 8

Run update21, to create the cs21/labs/08 directory (and some additional files). Then cd into your cs21/labs/08 directory and create the python programs for lab 8 in this directory

Introduction

For this lab, you will write a program called spellchecker.py that asks the user for a file name and then checks that each word in the file is spelled correctly. For words that are misspelled, your program will try to auto-correct by suggesting replacement words.

Here's an example:

$ cat letter.txt 

Dear Sam,

How are you?  I'm fine, but I misss seeing you more ofetn.  
Hope alk is well with your family, espceially your brother. 
I'll be stoping by next Saturddy when I get home from 
Swarthmore.  

Love, 

Alice
$ python spellchecker.py 

Enter name of file to spell check: letter.txt

Unknown words and suggestions:

misss : miss
ofetn : often
alk : all
espceially : especially
stoping : storing
Saturddy : Saturday
Swarthmore : Swarthmore

Getting Started

As a first step, begin by writing a function that can read in the file to spell check and generate a list of the individual words in the file. Your function should remove leading and trailing spaces, newlines and punctuation from each word, but don't worry about removing the apostrophe from contractions like I'm, don't, or I'll. Have your function return a list of words.

For the dictionary of valid words, use the file /usr/share/dict/words that we used in class. Read this file to generate a list of all valid words. Don't worry about removing proper nouns or contractions. Just strip any whitespace at the begin or end.

With the list of words from your text file and the list of dictionary words, generate a list of unknown and possibly misspelled words. For each word in the original text file, you should search for the word using binary search in the list of valid words. Once you have identified the misspelled words, you can move on to the next part of suggesting corrections.

Generating Alternatives

Let's begin by looking at how we generate alternatives for misspelled words. We'll use "seplling" as the running example. This word is not in the dictionary, so we need to propose alternatives. Our alternatives will be generated using three different methods: deletion, transposition, and substitution. Each is explained below:

Deletions

We form alternatives of the word by generating all possible single-letter deletions. Using our example, "seplling", we would generate: eplling, splling, selling, sepling, sepling, sepllng, sepllig, and sepllin. Each alternative differs from the original word by only one letter. For a word of length n, there are n alternates we can form by deleting a single character; here, the word (seplling) was 8 letters long and there are 8 alternatives we can form using deletion.

Transpositions

For transpositions, we swap two adjacent letters in the word. Using our example, "seplling", we would generate: esplling, spelling, selpling, seplling, seplilng, sepllnig and sepllign. For a word of length n, there are n-1 alternates we can form by transposing a pair of characters; here, the word was 8 letters long and there are 7 alternatives we can form using transposition.

Substitutions

We form alternatives of the word by generating all possible single-letter substitutions. For each letter in the original word, we substitute a letter from the alphabet (abcd...xyz) to replace it. There are too many possible substitutions to list them all. Using our example, "seplling", we would generate:

aeplling	saplling	sealing	sepaling	seplaing	sepllang	seplliag	sepllina
beplling	sbplling	sebling	sepbling	seplbing	sepllbng	sepllibg	sepllinb
ceplling	scplling	secling	sepcling	seplcing	sepllcng	sepllicg	sepllinc
deplling	sdplling	sedling	sepdling	seplding	seplldng	sepllidg	sepllind
eeplling	seplling	seeling	sepeling	sepleing	seplleng	sepllieg	seplline
...	...	...	...	...	...	...	...
yeplling	syplling	seyling	sepyling	seplying	sepllyng	seplliyg	seplliny
zeplling	szplling	sezling	sepzling	seplzing	sepllzng	sepllizg	sepllinz

Note that many of these alternatives are not actual words. Only a few such as selling via deletion of a p or spelling via transposition of ep are valid words.

You program should contain one or more functions that generate possible alternatives for a given word and check which of the alternatives are valid words.

Ranking alternatives

For the "seplling" example, we find that only 2 of the alternatives are legal words in our dictionary: selling, and spelling. Your program should suggest only one of these two alternatives, by selecting which word is more frequently used.

To help answer this question, we have provided you with a new dictionary (/usr/local/doc/wordcts.txt) which contains each of the words from /usr/share/dict/words augmented with a frequency with which this word occurred in a very large document collection. Here are the first 10 lines of that file:

a 7841087012
aardvark 93030
aardvarks 11762
abaci 5391
aback 272920
abacus 149548
abacuses 4473
abaft 19188
abalone 237386
abalones 6544

Very frequently used words such as "a" occurred very often (7,841,087,012 times). Common words such as "abacus" occurred more frequently (149,548 times) than uncommon words such as "abaci" (5,391 times). We will use the frequency with which words occurred to rank the alternatives and we will choose the most frequently occurring word as the proposed alternative.

In our running example, we want to choose the alternative of "seplling" that is most frequent. Consulting wordcts.txt, we find that "selling" is the most frequent:

selling 29092601
spelling 5373561

Therefore our program should choose selling as the proposed alternative.

To implement this ranking algorithm, you will first need to read the file /usr/local/doc/wordcts.txt into a list of [word, frequency] lists:

[['a', 7841087012], ['aardvark', 93030], ['aardvarks', 11762], ...]

You will then need to search this list of [word, frequency] pairs for each of your proposed alternatives. This means for "seplling" you will need to perform two searches in this list. You will want to use binary search since the list of words is in sorted order. You will then need to choose the word that has the highest frequency of all the proposed alternatives.

Note that each item in this list is a [word, frequency] pair. When searching for selling, you cannot directly compare the word "selling" to the list ["selling", 29092601]. You will need to write a separate binary search function that is very similar to the search used in the first part of this lab for searching /usr/share/dict/words. Since we are only interested in the frequency of the word, you can return the frequency, e.g., 29092601 instead of the position in this search, and return -1 if the word is not found.

Combine this new binary search code with the rest of your program so that for each misspelled word, your program suggests the best alternative word. If there are two or more alternative words with the same highest frequency, you can return any of the matching valid words with this frequency

Note that for some misspelled words, there will be no alternatives that occur in the dictionary. For example, if you type in "Swarthmore", there are no alternatives which occur in the word list. In cases like this, simply return the original word, "Swarthmore".

Requirements

use the str split() method to split lines from the user's file into individual words
use the str strip() method to strip off spaces, newlines, and punctuation from words
use the system dictionary /usr/share/dict/words to determine if a word is spelled correctly
use binary search when looking up words in the dictionary.
some words, like Swarthmore, are spelled correctly but not listed in the system dictionary.
handle both lowercase and capitalized words correctly (ex: 'How' and 'how')
make good use of functions in your code
Only read in the files /usr/share/dict/words and /usr/local/doc/wordcts.txt once and generate the appropriate lists. Do not re-read these files for each search you perform.
run your program on the other .txt files in your directory (see ADDITIONAL EXAMPLES below)
make sure all sample text files (gbaddress.txt, chapter1.txt) work as shown in our ADDITIONAL EXAMPLES page

ADDITIONAL EXAMPLES

Helpful tips

You can get a list of all the punctuation from the string library

from string import punctuation
punx = punctuation
print punx

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

To handle capitalization, first check if the word is in the dictionary without changing the word. This will find words like "Sam", "I'm" or "I'll", which are in the dictionary with capitalization. If a search indicates the word is not found, try again with a lower case version of the word using word.lower(). If neither search finds the word, consider it misspelled or unknown.

Extra Challenges...

These are not part of the lab assignment, but just "fun" extra things you can add to your program. Please don't try these until you have completed the above assignment. And they are just for fun -- no bonus points or extra credit.

If you attempt the extra challenges, please save your changes in spellcheck_extra.py instead of spellcheck.py

$ cp spellcheck.py spellcheck_extra.py
$ vim spellcheck_extra.py

one-time: change your program so that it only lists a misspelled word once, rather than each time it is misspelled in the document
insertions: generate more possible alternative words by inserting a single lowercase letter into each possible position. For example, the typo "ot" could be expanded to 'cot', 'dot', 'got', 'hot', 'jot', 'lot', 'not', 'pot', 'rot', 'sot', 'tot', 'wot', 'oat', 'oft', 'opt', or 'out'.
bigger example: try your spell checker on a larger file: /usr/local/doc/alice.txt

Submit

Once you are satisfied with your program, hand it in by typing handin21 in a terminal window.

CS21 Lab 8: Auto-correct

Deletions

Transpositions

Substitutions