I have a database where i am storing more than 1000000 names in mysql. Now the task of my application is a bit typical. I not only searches for names in the database,but also finds similar names. Suppose the name is entered as christian
, then the application will show suggested names like christine
, chris
etc. What is the optimal way to do this, without using the like
clause. The suggestions will be only on the changes in the last part of the name.
Once you have grouped data you can filter out duplicates by using having clause. Having clause is the counterpart of where clause for aggregation queries. Just remember to provide a temporary name to count() data in order to use them in having clause.
One way to find duplicate records from the table is the GROUP BY statement. The GROUP BY statement in SQL is used to arrange identical data into groups with the help of some functions. i.e if a particular column has the same values in different rows then it will arrange these rows in a group.
To select duplicate values, you need to create groups of rows with the same values and then select the groups with counts greater than one. You can achieve that by using GROUP BY and a HAVING clause.
In SQL, problems require us to compare two columns for equality to achieve certain desired results. This can be achieved through the use of the =(equal to) operator between 2 columns names to be compared.
If you want also similar names (by sound) something like SOUNDEX()
could help: http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
Otherwise … LIKE 'chri%'
seems for me not a bad idea?
If you really want just the first characters without LIKE
you can use SUBSTRING()
.
You could use php's metaphone() function to generate the metaphone-code for each name and store them along with the names.
<?php
print "chris" . "\t" . metaphone("chris") . "\n";
print "christian" . "\t" . metaphone("christian") . "\n";
print "christine" . "\t" . metaphone("christine") . "\n";
# prints:
# chris XRS
# christine XRSTN
# christian XRSXN
You can then use a levenshtein distance algorithm (either in php [http://php.net/manual/en/function.levenshtein.php] or mysql [http://www.artfulsoftware.com/infotree/queries.php#552]) to calculate the distance between the metacodes. In my test below a distance of 2 or less seemed to indicate the level of similarity that you are seeking.
<?php
$names = array(
array('mike',metaphone('mike')),
array('chris',metaphone('chris')),
array('chrstian',metaphone('christian')),
array('christine',metaphone('christine')),
array('michelle',metaphone('chris')),
array('mick',metaphone('mick')),
array('john',metaphone('john')),
array('joseph',metaphone('joseph'))
);
foreach ($names as $name) {
_compare($name);
}
function _compare($n) {
global $names;
$name = $n[0];
$meta = $n[1];
foreach ($names as $cname) {
printf("The distance between $name and {$cname[0]} is %d\n",
levenshtein($meta, $cname[1]));
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With