This project is read-only.
I was looking for a algorithm implemented in C# that I can fuzzy matching a string. But I did not find a good one so I wrote one myself.

What is the problem?

Assume we have a table (named bookTable) in Windows Azure Table storage. The entity of this table has a property named "bookTitle", it is a string type.
If I want to search this table and find the exactly entity matching the property of a given book title, it is very easy. Just use
from b in bookTable where b.bookTitle == givenTitle select.....
However, in some cases, I do not have exact book title string, but I can remember some possibly words in the bookTitle. I want to search for similar bookTitle from the table, how can I do?

I know this is widely used in search engine. Some good algoritym like "SimHash" has patient by Google. I cannot use for free. There are many implementation of other algoritym in Java or C, I cannot find a easy simple ready to used algorithm written in C# and LINQ.

I did a few research, and wrote this algorithm in C# and LINQ.

How did it work?

My algorithm contains few steps
  1. Separate the input string into groups, every groups contains 2 words. For example. "My name is Kevin" is separated to the following 3 groups
    • My name
    • name is
    • is Kevin
  2. Each groups got from the step above is convert into a 32bit hash value. So the string " My name is Kevin" will become 3 32bit int value.
  3. Do the same for another string which will be compared with this string. We call the two string: strA and strB.
  4. strA and strB are change to 2 Int32[]. I call these 2 Int32[] as SetA and SetB
  5. Calculate a likeness index using (SetA intersect SetB).Count / (SetA union SetB).Count
  6. if the likeness index is 1, then the strA and strB are the same, if the index is 0, then strA and strB is totally different. If the index is 0.8, then the strA and strB has 80% similarity, or 80% likeness.
  7. OrderBy the likeness index descend. I can get the most likely matching string

In my example, I can save the Int32[] hash value array convert to a base64 string, use this string as Partition Key of this Azure Table.

Last edited May 25, 2010 at 4:45 AM by KevinZ, version 2