Monday, January 19, 2009

Searching for Similar Words. Similarity Metric



How one can find out if two or more words are similar? I do not mean semantically similar (synonyms aren't taken into consideration), but visually similar.

Consider, these two words, "sample1" and "sample2". Do they look similar? Well, at least they have the same start - "sample". The next two have merely common letters in them: "fox" and "wolf".

One of the methods that can be used to measure words similarity is Euclidian distance. It is used to measure distance beetwen two point in space.

The code to measure similarity of 2 strings:

public static double Euclidian(string v1Orig, string v2Orig)
{
string v1, v2;
if (v1Orig.Length > v2Orig.Length)
{
v1 = v1Orig; v2 = v2Orig;
}
else
{
v1 = v2Orig; v2 = v1Orig;
}

double d = 0.0;
for (int i = 0; i < v1.Length; i++)
{
if ( i < v2.Length )
d += Math.Pow((v1[i] - v2[i]), 2);
else
d += Math.Pow((v1[i] - 0.0), 2);
}
return Math.Sqrt(d);
}

Using the code above we can get numbers that measure words similarity:
"sample1" and "sample2" will give 1.0. While "wolf" and "fox" give 104.10091258005379. Words that are identical will give 0.0. Thus the less number you get the more two words are similar.

In the context of Euclidean distance, "fox" and "wolf" have greater distance then "sample1" and "sample2".

This measurement approach can be used when searching for word groups in the text.