Spelling correction is not a trivial task for a computer. Better and better models are invented to tackle problems such as spelling correction. Language models are the kind of models that are being used for this task. Language models are also used for correcting errors in speech recognition, machine translation, for language and authorship identification, text compression and topic relevance ranking. In this article, language models are being used for a simple spelling correction application.
There are many ways to tackle spelling errors. One of the simplest ideas is to check each word against a dictionary. There are some problems that can occur here. For example, if I made an error in the word “ra”, did I mean “rat” or “read” originally? It depends on the context. In the sentence “I saw a ra running”, I probably mean “I saw a rat running” (and not “I saw a read running”). Furthermore, it would take a vast amount of space to store the dictionary and besides that it does not take into account new words that are not stored in the dictionary. Another way of handling spelling errors, is to convert a sentence to its Soundex and then convert it back to a sentence. The Soundex was invented by (Knuth 1973, Davidson 1962) and it convert words to only their sound related features. The word “bicycle” is converted code “B224” in the Soundex system. Also “bycicle” and “bicycle” are converted to “B224”, so even with spelling errors we get the same code. Creating the code is done by discarding vowels and grouping consonant letters. If we have converted all words to their corresponding Soundex code, we can try to reconstruct the original sentence without spelling errors.
Language Model?
Now what is a language model? A language model is a simplified statistical model of text. It is a data driven approach instead of the many rule based approaches that exist. Rule based approaches are based on expert knowledge. The problem here is that languages (at least the human language) are approximately infinite. Every word has at least many variants and to make things worse, words can be combined to form new words. If a new brand pops up, then new words are introduced. So language is changing over time. That means that a static set of rules can never define all aspects of a language. Therefore, we need another approach to model language. This can be done by a statistical language model.
A Topic Model
Suppose we have a language model $latex M_{\mbox{English}}$. Then we are able to compare how well-formed a sentence is. Suppose we have a sentence “The bycicle was green.” You can see the spelling error “bycicle” here. The correct sentence would be “The bicycle was green.” So, in our statistical model, we would say that it is more probable that “The bicycle was green.” was generated by the English language model than “The bycicle was green.” In mathematics, we would say: $latex P(\mbox{The bicycle was green.} | M_{\mbox{English}}) > P(\mbox{The bycicle was green.} | M_{\mbox{English}})$. |
Notice that we are also capable of finding the right topic for a text. Suppose we have two models: $latex M_{\mbox{sports}}$ and $latex M_{\mbox{food}}$. Suppose we have a sentence “Hot coffee.” Then, we now that $latex P(\mbox{Hot coffee.} | M_{\mbox{food}} )$ $latex > P(\mbox{Hot coffee.} | M_{\mbox{sports}} )$. In words: it is more probable that “Hot coffee.” is generated by a language model about food than that it was generated by a language model about sports. |
By the way, maybe this article is interesting for you. It is about the implementation of a e-mail spam detector and is closely related to language modeling. Or maybe you are interested in a more advanced model in which we will use a sequence-to-sequence model.
The Spelling Correction problem
Now we can turn to our spelling correction problem. First, we need a sentence which has a spelling error in it. Take the sentence “buy chocolat chip cookies” as an example. Our goal is to find that “chocolat” has a spelling error. Suppose we have a language model $latex M$ which we would like to train and use for our example sentence. So we first need some correctly spelled sentences. For simplicity we will use the following correctly spelled sentences:
1 |
|
A Simple Unigram Model
Conclusions (TL;DR)So by using a statistical approach with a language model, we were able to find and correct a spelling error. By using a rule based approach, this would also be possible, but it would not work for relatively new words. There are many extensions possible to this statistical approach, but notice that it is a powerful approach which can be easily transformed to give solutions to related problems (such as finding a topic for a given sentence). If you have any questions or additions, feel free to write a comment.