Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use kd-trees for determining string similarity?

I am trying to utilize k-nearest neighbors for the string similarity problem i.e. given a string and a knowledge base, I want to output k strings that are similar to my given string. Are there any tutorials that explain how to utilize kd-trees to efficiently do this k-nearest neighbor lookup for strings? The string length will not exceed more than 20 characters.

like image 662
Legend Avatar asked Apr 17 '11 22:04

Legend


People also ask

How do you measure string similarity?

The way to check the similarity between any data point or groups is by calculating the distance between those data points. In textual data as well, we check the similarity between the strings by calculating the distance between one text to another text.

How does kd tree work?

A K-D Tree(also called as K-Dimensional Tree) is a binary search tree where data in each node is a K-Dimensional point in space. In short, it is a space partitioning(details below) data structure for organizing points in a K-Dimensional space.

What is string similarity search?

String similarity search is a fundamental query that has been widely used for DNA sequencing, error-tolerant query autocompletion, and data cleaning needed in database, data warehouse, and data mining.

Does Knn use kd tree?

In this paper, the KD-tree will be used for spatial indexing in a new KNN-based spatial clustering algorithm. A k-d tree, or k-dimensional tree, is a data structure used for organizing some number of points in a space with k dimensions.

Is an octree a KD tree?

Note that octrees are not the same as k-d trees: k-d trees split along a dimension and octrees split around a point. Also k-d trees are always binary, which is not the case for octrees. By using a depth-first search the nodes are to be traversed and only required surfaces are to be viewed.


1 Answers

Probably one of the hottest blog posts I had read a year or so ago: Levenstein Automata. Take a look at that article. It provides not only a description of the algorithm but also code to follow. Technically, it's not a kd-tree but it's quite related to the string matching and dictionary correction algorithms one might encounter/use in the real world.

He also has another blog post about BK-trees which are much better at the fuzzy matching for strings and string look ups where there are mispellings. Here is another resource containing source code for a BK-tree (this one I can't verify the accuracy or proper implementation.)

like image 116
wheaties Avatar answered Oct 04 '22 07:10

wheaties