Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimal way to find similar value from a large table

Tags:

sql

mysql

I have a database where i am storing more than 1000000 names in mysql. Now the task of my application is a bit typical. I not only searches for names in the database,but also finds similar names. Suppose the name is entered as christian, then the application will show suggested names like christine, chris etc. What is the optimal way to do this, without using the like clause. The suggestions will be only on the changes in the last part of the name.

like image 209
user794091 Avatar asked Jun 11 '11 16:06

user794091


People also ask

How can you filter the duplicate data while retrieving records from the table?

Once you have grouped data you can filter out duplicates by using having clause. Having clause is the counterpart of where clause for aggregation queries. Just remember to provide a temporary name to count() data in order to use them in having clause.

How to find the Duplicates in a table in sql?

One way to find duplicate records from the table is the GROUP BY statement. The GROUP BY statement in SQL is used to arrange identical data into groups with the help of some functions. i.e if a particular column has the same values in different rows then it will arrange these rows in a group.

How to select only Duplicate values in sql?

To select duplicate values, you need to create groups of rows with the same values and then select the groups with counts greater than one. You can achieve that by using GROUP BY and a HAVING clause.

How do I check if two columns have the same value in SQL?

In SQL, problems require us to compare two columns for equality to achieve certain desired results. This can be achieved through the use of the =(equal to) operator between 2 columns names to be compared.


2 Answers

If you want also similar names (by sound) something like SOUNDEX() could help: http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex

Otherwise … LIKE 'chri%' seems for me not a bad idea?

If you really want just the first characters without LIKE you can use SUBSTRING().

like image 198
flori Avatar answered Sep 21 '22 01:09

flori


You could use php's metaphone() function to generate the metaphone-code for each name and store them along with the names.

<?php
print "chris" . "\t" . metaphone("chris") . "\n";
print "christian" . "\t" . metaphone("christian") . "\n";
print "christine" . "\t" . metaphone("christine") . "\n";

# prints:
# chris      XRS
# christine  XRSTN
# christian  XRSXN

You can then use a levenshtein distance algorithm (either in php [http://php.net/manual/en/function.levenshtein.php] or mysql [http://www.artfulsoftware.com/infotree/queries.php#552]) to calculate the distance between the metacodes. In my test below a distance of 2 or less seemed to indicate the level of similarity that you are seeking.

<?php
$names = array(
        array('mike',metaphone('mike')),
        array('chris',metaphone('chris')),
        array('chrstian',metaphone('christian')),
        array('christine',metaphone('christine')),
        array('michelle',metaphone('chris')),
        array('mick',metaphone('mick')),
        array('john',metaphone('john')),
        array('joseph',metaphone('joseph'))
);

foreach ($names as $name) {
        _compare($name);
}

function _compare($n) {
        global $names;
        $name = $n[0];
        $meta = $n[1];

        foreach ($names as $cname) {
                printf("The distance between $name and {$cname[0]} is %d\n",                          
                  levenshtein($meta, $cname[1]));
        }
}
like image 38
spuriousdata Avatar answered Sep 19 '22 01:09

spuriousdata