Implement spelling correction using Language Models

Spelling correction is not a trivial task for a computer. Better and better models are invented to tackle problems such as spelling correction. Language models are the kind of models that are being used for this task. Language models are also used for correcting errors in speech recognition, machine translation, for language and authorship identification, text compression and topic relevance ranking. In this article, language models are being used for a simple spelling correction application.

There are many ways to tackle spelling errors. One of the simplest ideas is to check each word against a dictionary. There are some problems that can occur here. For example, if I made an error in the word “ra”, did I mean “rat” or “read” originally? It depends on the context. In the sentence “I saw a ra running”, I probably mean “I saw a rat running” (and not “I saw a read running”). Furthermore, it would take a vast amount of space to store the dictionary and besides that it does not take into account new words that are not stored in the dictionary. Another way of handling spelling errors, is to convert a sentence to its Soundex and then convert it back to a sentence. The Soundex was invented by (Knuth 1973, Davidson 1962) and it convert words to only their sound related features. The word “bicycle” is converted code “B224” in the Soundex system. Also “bycicle” and “bicycle” are converted to “B224”, so even with spelling errors we get the same code. Creating the code is done by discarding vowels and grouping consonant letters. If we have converted all words to their corresponding Soundex code, we can try to reconstruct the original sentence without spelling errors.

Language Model?

Now what is a language model? A language model is a simplified statistical model of text. It is a data driven approach instead of the many rule based approaches that exist. Rule based approaches are based on expert knowledge. The problem here is that languages (at least the human language) are approximately infinite. Every word has at least many variants and to make things worse, words can be combined to form new words. If a new brand pops up, then new words are introduced. So language is changing over time. That means that a static set of rules can never define all aspects of a language. Therefore, we need another approach to model language. This can be done by a statistical language model.

A Topic Model

Suppose we have a language model $latex M_{\mbox{English}}$. Then we are able to compare how well-formed a sentence is. Suppose we have a sentence “The bycicle was green.” You can see the spelling error “bycicle” here. The correct sentence would be “The bicycle was green.” So, in our statistical model, we would say that it is more probable that “The bicycle was green.” was generated by the English language model than “The bycicle was green.” In mathematics, we would say: $latex P(\mbox{The bicycle was green.} M_{\mbox{English}}) > P(\mbox{The bycicle was green.} M_{\mbox{English}})$.
Notice that we are also capable of finding the right topic for a text. Suppose we have two models: $latex M_{\mbox{sports}}$ and $latex M_{\mbox{food}}$. Suppose we have a sentence “Hot coffee.” Then, we now that $latex P(\mbox{Hot coffee.} M_{\mbox{food}} )$ $latex > P(\mbox{Hot coffee.} M_{\mbox{sports}} )$. In words: it is more probable that “Hot coffee.” is generated by a language model about food than that it was generated by a language model about sports.

By the way, maybe this article is interesting for you. It is about the implementation of a e-mail spam detector and is closely related to language modeling. Or maybe you are interested in a more advanced model in which we will use a sequence-to-sequence model.

The Spelling Correction problem

Now we can turn to our spelling correction problem. First, we need a sentence which has a spelling error in it. Take the sentence “buy chocolat chip cookies” as an example. Our goal is to find that “chocolat” has a spelling error. Suppose we have a language model $latex M$ which we would like to train and use for our example sentence. So we first need some correctly spelled sentences. For simplicity we will use the following correctly spelled sentences:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
from collections import defaultdict
import math


def get_words(sentence):
 """
 Get a list of all words which are found in a given sentence.

 Parameters
 ----------
 sentence : str
 The sentence to find words in.

 Returns
 -------
 list
 A list containing all words found in the sentence.
 """
 return sentence.split()


def calculate_distance(word1, word2):
 """
 Calculate the distance between two words. This method could be replaced by a method calculating the Levenshtein
 distance.

 Parameters
 ----------
 word1 : str
 First word.
 word2 : str
 Second word.

 Returns
 -------
 float
 Distance between word1 and word2
 """
 return len(word1.replace(word2, ''))

# Define the training and test data
training_sentences = ["cookies", "chocolate", "chip", "chocolate chip cookie", "chocolate chip cookies", "buy"]
sentence = "buy chocolat chip cookies"

# Find all words in the data
training_data = [get_words(sentence) for sentence in training_sentences]
words = get_words(sentence)

# Show the sentence with spelling errors
print("Incorrect sentence:\t{0}\n".format(sentence))

# Count unigrams
counts = defaultdict(int)
for sentence_data in training_data:
 for unigram in sentence_data:
 counts[unigram] += 1

# Normalize probabilities
totals = sum(counts.values())
for unigram in counts:
 counts[unigram] /= totals

# Now check every word in the test sentence
correct_unigrams = []
for word in words:
 if counts[word] == 0:
 # The word did not occur in the training data!
 print("Spelling error detected:\t{0}".format(word))
 # Now find a good replacement
 minimum_distance = math.inf
 best_replacement = None
 for vocab_word in counts:
 distance = calculate_distance(vocab_word, word)
 # Make sure the vocabulary word is not the same as the spelling error and make sure it is better than the
 # result found so far
 if 0 < distance < minimum_distance:
 minimum_distance = distance
 best_replacement = vocab_word
 # Show the correction
 print("Best found correct word:\t{0} (distance: {1})\n".format(best_replacement, minimum_distance))
 correct_unigrams.append(best_replacement)
 else:
 correct_unigrams.append(word)

print("Corrected sentence:\t{0}".format(' '.join(correct_unigrams)))

A Simple Unigram Model

Conclusions (TL;DR)So by using a statistical approach with a language model, we were able to find and correct a spelling error. By using a rule based approach, this would also be possible, but it would not work for relatively new words. There are many extensions possible to this statistical approach, but notice that it is a powerful approach which can be easily transformed to give solutions to related problems (such as finding a topic for a given sentence). If you have any questions or additions, feel free to write a comment.