## Ratcliff/Obershelp String Matching

One of the most interesting things in my collection is the DrDobbs' Journal Archive DVD. It contains all the DrDobbs articles all the way back to 1988.

One thing I like to do is browse through this archive and seeing how the world of computers changed since 1988. In the 80's FORTH and Ada were languages that were widely used; The 90's saw the rise of C++ and then Java and the coming of the Internet; In the 00's you'll see the dotcom crash followed by the coming of Google and .Net.

(Reading the editorials are also very interesting. One particular article caught my attention recently is titled "Future History" by Michael Swaine in the April 1997 issue. It was written shortly after Apple announced that they would buy NeXT instead of BeOS as the whole industry was led to believe. It provides some interesting reading about the condition Apple found itself in at the time, Steve Jobs' return to Apple and the technological edge provided by NeXT.)

But I digress...

One of the articles from 1988 that piqued my interest is called "Pattern Matching: the Gestalt Approach" by John W. Ratcliff and David E. Metzener. It describes an algorithm called Ratcliff/Obershelp (which I'll abbreviate R/O) that compares how similar two strings are.

You would use such an algorithm in cases where your program may get a string input from the user and you want to validate it. The example in the original article was an educational system where users can enter answers (instead of using multiple choice questions, as computerized education systems tend to do). The R/O algorithm can compare the correct answer against the user's answer and a decision can be made on whether the user misspelt the answer or just answered incorrectly.

The article also had an example of a compiler that keeps on compiling even though the user made a typo with a variable name or keyword. In case of an error, the R/O can be used to find the most likely keyword and try to continue compiling. This way it can find more compiler errors in a single run (the article was written in a time when compilation of programs was slow), and suggest corrections to the programmer.

I also thought about this algorithm when a colleague of mine played around with soundex to implement a spell checker a while back. In my opinion, the R/O method is a better fit for spell checking applications; soundex is too specific to English and the result of a soundex is difficult to use for a spell checker. I will get to R/O's usage in a spell checker a bit further down.

The algorithm compares two strings by finding the longest submatch between them. It then uses this submatch as an anchor. It then does the same recursively for the strings to the left and to the right of the anchor. Recursion stops when there are no more anchors to be found. The result is 2 times the sum of the lengths of all the anchors divided by the length of the two strings being compared. My implementation converts the result to an integer percentage, so that 100% means the strings are exact matches and 0% means that the two strings have nothing in common.

This article also describes the algorithm and it has a nice table that illustrates its operation. It is focussed on .Net and T-SQL, but the first half of the article gives you more information on what I'm discussing here.

Eric S. Raymond (ESR) once suggested Ratcliff/Obershelp be added to the Python standard libraries. It didn't get added to the Python libraries, but the ensuing discussion was interesting in typical ESR fashion. You can read the thread here on the Python mailing lists. I like the this quote of his to describe why Ratcliff/Obershelp is useful from the thread:

...you take your unknown input, check it against a dictionary and kick "maybe you meant foo?" to the user for every foo with an R/O similarity above 0.6 or so. The effects look like black magic. Users love it.

The algorithm in the original DrDobbs article was written in assembly. In the November 1988 issue they published two readers' C implementations of this algorithm - look for the heading "Real Programmers Don't Use Assembly Language". If you've got access to that column, you'll notice that my implementation is based on the one by Joe Preston, who introduced his version by stating that "like all real programmers, I dislike having to use assembly language."

You can find the source code of my implementation in simil.c as part of my miscsrc package.

I should also mention that a modern system could also take advantage of Bayesian inference when checking for spelling errors of the kind this algorithm checks for. Here is the article How to Write a Spelling Corrector by Peter Norvig that I referenced when writing this section (FYI Peter Norvig is the director of research at Google and presenter of the Artificial Intelligence course I did last year):

Suppose a user may write "thet" when in fact he meant "that", you can express the probability that a user meant "that" when he typed "thet" like so:

$P(\text{“that”}|\text{“thet”})$

You can then apply Bayes' rule as follows:

$P(\text{“that”}|\text{“thet”}) = \frac{P(\text{“thet”}|\text{“that”}) \times P(\text{“that”})}{P(\text{“thet”})}$

which you can express in English as follows: "The probability that a user meant "that" when he typed "thet" is equal to the probability that he would've misspelled "that" as "thet" (the P("thet"|"that") term) multiplied by the probability that he would've written "thet" (the P("thet") term) and divided by the probability that he would've written "that" (the P("that") term)."

The P("thet") and P("that") terms can be determined from statistics. Suppose you've got an essay of ten thousand English words and of those thousand searches 20 were "that" and only one was for "thet", then P("that") is 20/10000 while P("thet") is only 1/10000. If you read Peter Norvig's article you'll see that you don't actually need the P("thet") for this particular problem.

The P("thet"|"that") is the probability that the user would accidently misspell "that" as "thet". You can uses statistics for this, but Norvig used the edit distance between two words as an approximation.

You can use the Ratcliff/Obershelp method to estimate P("thet"|"that"), and that is the point I'm trying to get to. simil("thet","that") will return 75, so P("thet"|"that") can be approximated as 0.75.

When you implement this, you'll have a list of english words, and calculate P(w|"thet") for every word w in the list, and choose the one for which the probability is the greatest. If your list only has "the" and "that" and you find that $$P(\text{“thet”}|\text{“that”}) > P(\text{“thet”}|\text{“the”})$$ it implies that the user probably meant "that" and typed "thet" by accident.

Well, I've digressed a lot, but I think I said what I set out to say. Take care until next time.