Note about a bug fix
IMPORTANT NOTE: The output in interaction.py is no longer valid. Please see the bottom of the webpage for (what I hope) is the correct output!
Introduction
In this week’s lab, you will build on last week’s edit distance finding code to implement a spell-checker that a) generates suggested spelling corrections and b) automatically fixes spelling errors.
Answers to written questions should be added to a file called Writeup.md in your repository.
EditDistanceFinder
This week’s starter code includes an EditDistance.py file that is the same as the one you wrote last week, but with a couple of additions:
- There’s a probmethod that returns the log likelihood of one string being converted to another.
- Laplace smoothing has been added.
- An argparse interface has been added.
Questions
- In Writeup.md, explain how Laplace smoothing works in general and how it is implemented in theEditDistance.pyfile. Why is Laplace smoothing needed in order to make theprobmethod work? In other words, theprobmethod wouldn’t work properly without smoothing – why?
- Describe the command-line interface for EditDistance.py. What command should you run to generate a model from/data/spelling/wikipedia_misspellings.txtand save it toed.pkl?
LanguageModel
This lab’s starter code also includes a file called LanguageModel.py that defines an n-gram language model. Read through the code for the LanguageModel class, then answer the following questions:
- What n-gram orders are supported by the given LanguageModelclass?
- How does the given LanguageModelclass deal with the problem of 0-counts?
- What behavior does the “__contains__()” method of the LanguageModelclass provide?
- Spacy uses a lot of memory if it tries to load a very large document. To avoid that problem, LanguageModellimits the amount of text that’s processed at once with theget_chunksmethod. Explain how that method works.
- Describe the command-line interface for LanguageModel.py. What command should you run to generate a model from/data/gutenberg/*.txtand save it tolm.pklif you want analphavalue of 0.1 and a vocabulary size of 40000?
The language model takes a bit of time to train – on the order of 20 minutes or so depending on what machine you use. You may want to start training the LanguageModel in another window before you continue reading the lab writeup.
Required Part (Everyone Does the Same Thing)
Your job for this week will be to write a SpellChecker that uses the EditDistanceFinder class as the error (channel) model and the provided LanguageModel as the language model to implement spelling correction.
You will be using spacy again in this lab, making use of the built-in part-of-speech tagger and parser. To initialize spacy for the lab, use the line below. You will probably want the nlp variable to be an instance variable in your class.
nlp = spacy.load("en", pipeline=["tagger", "parser"])
Your class should have the following member functions:
- __init__(channel_model=None, language_model=None, max_distance=2), which should take an- EditDistanceFinder, a- LanguageModel, and an- intas input, and should initialize your- SpellChecker.
- load_channel_model(fp), which should take a file pointer as input, and should initialize the SpellChecker object’s- channel_modeldata member to a default- EditDistanceFinderand then load the stored language model (e.g.- ed.pkl) from- fpinto that data member.
- load_language_model(fp), which should take a file pointer as input, and should initialize the SpellChecker object’s language_model data member to a default- LanguageModeland then load the stored language model (e.g.- lm.pkl) from- fpinto that data member.
- bigram_score(prev_word, focus_word, next_word), which should take three words as input (a “previous” word, a “focus” word, and a “next” word), and should return the average of the bigram score of the bigrams- (prev_word, focus_word)and- (focus_word, next_word)according to the- LanguageModel.
- unigram_score(word), which should take a word as input, and should return the unigram probability of the word according to the- LanguageModel.
- cm_score(error_word, corrected_word)(“channel model score”), which should take an error word and a possible correction as input, and should return the- EditDistanceFinder’s probability of the corrected word having been transformed into the error word. Be careful about the order of the arguments that you pass to the- EditDistanceFinder: because of how you’ve trained the probability model, P(error_word|corrected_word) may not equal P(corrected_word|error_word).
- inserts(word), which should take a word as input and return a list of potential words that are within one insert of- word.
- deletes(word), which should take a word as input and return a list of potential words that are within one deletion of- word.
- substitutions(word), which should take a word as input and return a list of potential words that are within one substitution of- word.
- generate_candidates(word), which should take a word as input and return a list of candidate words (that are in the- LanguageModel) that are within- self.max_distanceedits of- wordby calling- inserts,- deletes, and- substitutions. To find all words that are edit distance 1 away, just call- inserts,- deletesand- substitutionsand concatenate those results together. To generate candidate words that are distance 2 away, first generate all the candidates that are 1 away. Then, generate all the 1-edit-distance-away candidates for each of those. Continue in this fashion for distance 3, etc.
- check_sentence(sentence, fallback=False), which should take a list of words as input and return a list of lists. Each sublist in the return value corresponds to a single word in the input sentence. Words in the sentence that are in the language model will be represented as a sublist containing just that word. Words in the sentence that are not in the language model will be represented as a sublist of possible corrections. This sublist of possible corrections should be, for each word in the sentence not in the language model, the result of calling- generate_candidateswith each of the candidates in the list and then sorting these candidates by the combination of LanguageModel score and EditDistance score. If no candidates are found and- fallbackis- True, then non-words should be represented by a sublist with just the original word (the same representation as correctly-spelled words).
- check_text(text, fallback=False), which should take a string as input, tokenize and sentence segment it with- spacy, and then return the concatenation of the result of calling- check_sentenceon all of the resulting sentence objects.
- autocorrect_sentence(sentence), which should take a tokenized sentence (as a list of words) as input, call- check_sentenceon the sentence with- fallback=True, and return a new list of tokens where each non-word has been replaced by its most likely spelling correction.
- autocorrect_line(line), which should take a string as input, tokenize and segment it with- spacy, and then return the concatenation of the result of calling- autocorrect_sentenceon all of the resulting sentence objects.
- suggest_sentence(sentence, max_suggestions), which should take a tokenized sentence (as a list of words) as input, call- check_sentenceon the sentence, and return a new list where:- Real words are just strings in the list
- Non-words are lists of up to max_suggestionssuggested spellings, ordered by your model’s preference for them.
 
- suggest_text(text, max_suggestions), which should take a string as input, tokenize and segment it with- spacy, and then return the concatenation of the result of calling- suggest_sentenceon all of the resulting sentence objects.
Hints and additional information about some of these functions follow:
- For checking bigram probabilities of the first or last word in a sentence, you’ll want to make use of ‘<s>’ (start of sentence token) and ‘</s>’ (end of sentence token); the langauge model is trained to know what they are.
- For ranking suggestions, I suggest using:
    - Doing an evenly-weighted linear combination (.5, .5) of the unigram and bigram probabilities for the language model
- Evenly weighting the language model and channel model (since we’re in log space, that means just taking their sum).
 
- When you are generating candidate corrections, you may find the constant string.ascii_lowercasehelpful.
Sample Interaction
The file interaction.py gives a sample interaction with the SpellChecker class. If you call interaction.py from the command line with language and edit distance models created above, it should use them to check (and optionally autocorrect) sentences.
Evaluation
In /data/spelling/ there are two files:
- reddit_comments.txt, which is an aggressively-filtered set of comments from Reddit, based on this Kaggle set
- reddit_ispell.txt, which is the output we got by autocorrecting the comment file with ispell
For a variety of reasons, labeled corpora of spelling errors are hard to come by. You can perform a noisy evaluation of your system by comparing it to the ispell output.
The file autocorrect.py will use your spell checker, language model, and edit distance class to auto-correct every sentence in every line that is passed to it. Use your SpellChecker to autocorrect the reddit_comments.txt file, then use the diff tool to compare the output. Based on a hand analysis of a reasonable subset of differences, answer the following questions:
- How often did your spell checker do a better job of correcting than ispell? Conversely, how often did ispell do a better job than your spell checker?
- Can you characterize the type of errors your spell checker tended to best at, and the type of errors ispell tended to do best at?
- Comment on anything else you notice that is interesting about spell checking – either for your model or for ispell.
Optional Part (Pick One or More)
Once you have your spell checker working to correct non-words, you should add one of the following:
Phonetic Suggestions
Expand your generate_candidates to also suggest words whose pronunciation is within an edit distance of self.max_distance of each error word. Your solution should use the metaphone code that is included with the lab. In Writeup.md, you should:
- Describe your approach
- Give examples of how your approach works, including specific sentences where your new model gives a different (hopefully better!) result than the baseline model.
- Discuss any challenges you ran into, design decisions you made, etc.
Real-Word Correction
Add a new member function to your SpellChecker class called check_words() that generates suggested corrections for real word spelling errors. Your check_spelling() function should call check_words after check_sentence_words, so functions like autocorrect_sentence and suggest_sentence should work off of the combination of the two.
You should feel free to use the simplifying assumtion of at most one real-word spelling error in a sentence if it makes your task easier.
In Writeup.md, you should:
- Describe your approach
- Give examples of how your approach works, including specific sentences where your new model gives a different (hopefully better!) result than the baseline model.
- Discuss any challenges you ran into, design decisions you made, etc.
Transpositions
Extend your model to handle character transpositions, where two characters are “swapped,” resulting in spelling errors like “teh.”
In Writeup.md, you should:
- Describe your approach
- Give examples of how your approach works, including specific sentences where your new model gives a different (hopefully better!) result than the baseline model.
- Discuss any challenges you ran into, design decisions you made, etc.
Other Extensions
With instructor approval, you are encouraged to come up with other ways to expand your spell checker. Some ideas:
- Add one or more features that would make your spell checker work with another language.
- Change the error model or language model underlying your spell checking system. For example, how could vector semantics be included?
- Explore the bias inherent in spell checkers. Find and report on research related to whose language is represented in spell checkers, and how the way spell checkers are implemented might unequally impact different people.
- Add a way for your system to learn when new words should be added to your dictionary.
Some good places to start looking for relevant research:
In Writeup.md, you should:
- Describe your approach
- Give examples of how your approach works, including specific sentences where your new model gives a different (hopefully better!) result than the baseline model.
- Discuss any challenges you ran into, design decisions you made, etc.
Bug fix
The original interaction.py file contained incorrect sample output. Below is the correct sample output.
>>> print(s.channel_model.prob("hello", "hello"))
-0.6520393913851943
>>> print(s.channel_model.prob("hellp", "hello"))
-10.655417526736118
>>> print(s.channel_model.prob("hllp", "hello"))
-12.889127454847866
>>> print(s.check_text("they did not yb any menas"))
[[['they'], ['did'], ['not'], ['be', 'by', 'my', 'you', 'i', 'in', 'ye', 'b', 'y', 'rib', 'yet', 
'ay', 'if', 'job', 'ob', 'yo', 'jib', 'of', 'ly', 'on', 'ab', 'o', 'rob', 'orb', 'jub', 'it', 
'ty', 'bo', 'is', 'a', 'yea', 'mob', 'cab', 'web', 'sob', 'to', 'up', 'yon', 'yew', 'yes', 'cob', 
'an', 'obi', 'ebb', 'nob', 'do', 'iv', 'alb', 'bab', 'eye', 'tob', 'yaw', 'v', 'abi', 'mab', 'at',
'he', 'go', 'as', 'x', 'rub', 'gob', 'lye', 'sub', 'or', 'ix', 'aye', 'd', 'lbs', 'cub', 'pub', 
'tub', 'z', 'so', 'dab', 'bob', 'we', 'l', 'dye', 'k', 'pmb', 'n', 'xv', 'ho', 'hye', 'il', 'yer',
'wo', 'yee', 'ex', 'bye', 'yis', 'vp', 'ox', 'rye', 'oh', 'w', 'io', 'en', 'm', 'ed', 'h', 'me',
'am', 'xx', 'el', 'us', 'no', 'fye', 'eh', 't', 'qu', 'ii', 'r', 'e', 'c', 'ah', 'ha', 's', 'lo',
'al', 'uz', 'em', 'ad', 'ao', 'ow', 'og', 'vs', 'er', 'ir', 'et', 'mr', 'un', 'hm', 'th', 'ji',
'ai', 'xi', 'je', 'hi', 'ze', 'co', 'wm', 'ee', 'au', 'ou', 'ar', 'ca', 'um', 'ro', 'vi', 'de',
'dr', 'fa', 'va', 'sh', 'la', 'nt', 'tm', 'ma', 'gr', 'ur', 'di', 're', 'st', 'tu', 'da', 'ms', 
'le', 'pi', 'si', 'se'], ['any'], ['men', 'means', 'mens', 'meals', 'mes', 'mans', 'meanes', 
'meats', 'meat', 'menials', 'omens', 'mean', 'mene', 'mines', 'enos', 'menace', 'mend', 'meads', 
'zenas', 'kenaz', 'menan', 'seas', 'ment', 'jonas', 'mess', 'mead', 'medes', 'medals', 'enan', 
'monks', 'minus', 'ends', 'mews', 'fens', 'minds', 'dens', 'meal', 'midas', 'eras', 'amends', 
'pens', 'hena', 'hens', 'tens', 'vedas', 'meres', 'mental', 'lens', 'peas', 'lena', 'meah', 
'medad', 'venus', 'arenas', 'aeneas', 'metals', 'enam', 'medan', 'demas', 'teas', 'zenan', 
'kenan', 'meets', 'sends', 'merab', 'texas', 'tents', 'bends', 'melts', 'metal', 'tends', 'penal', 
'dents', 'lends', 'cents', 'rents', 'annas']]]
>>> print(s.autocorrect_line("they did not yb any menas"))
[['they'], ['did'], ['not'], ['be'], ['any'], ['men']]
>>> print(s.suggest_text("they did not yb any menas", max_suggestions=2))
[['they'], ['did'], ['not'], ['be', 'by'], ['any'], ['men', 'means']]
In addition, you may find this to be helpful:
>>> text = """This should take a list of words as input and return a list of lists. 
	Each sublist in the return value corresponds to a single word in the input 
	sentence. Words in the sentence that are in the language model will be represented
	as a sublist containing just that word. Words in the sentence that are not in the
	language model will be represented as a sublist of possible corrections. This sublist
	of possible corrections should be, for each word in the sentence not in the language
	model, the result of calling generate_candidates with each of the candidates in the
	list and then sorting these candidates by the combination of LanguageModel score and
	EditDistance score. If no candidates are found and fallback is True, then non-words
	should be represented by a sublist with just the original word (the same 
	representation as correctly-spelled words).""".lower()
>>> result = sp.autocorrect_line(text)
>>> print(' '.join([x[0] for x in result]))
this should take a list of words as put and return a list of lists . each subtlest in the 
return value corresponds to a single word in the put sentence . words in the sentence that 
are in the language model will be represented as a subtlest containing just that word . 
words in the sentence that are not in the language model will be represented as a subtlest 
of possible corrections . this subtlest of possible corrections should be , for each word 
in the sentence not in the language model , the result of calling generate_candidates with 
each of the candidates in the list and then sorting these candidates by the combination 
of languagemodel score and editdistance score . if no candidates are found and fallacy is 
true , then non - words should be represented by a subtlest with just the original word 
( the same representation as correctly - spilled words ) .