if anyone has ever submitted a story to digg, it checks whether or not the story is already submitted, I assume by a fuzzy search.
I would like to implement something similar and want to know if they are using a php class that is open source?
Soundex isnt doing it, sentences/strings can be up to 250chars in length
Fuzzy string matching can help improve data quality and accuracy by data deduplication, identification of false-positives etc.
Fuzzy matching (FM), also known as fuzzy logic, approximate string matching, fuzzy name matching, or fuzzy string matching is an artificial intelligence and machine learning technology that identifies similar, but not identical elements in data table sets.
Fuzzy searches help you find relevant results even when the search terms are misspelled. To perform a fuzzy search, append a tilde (~) at the end of the search term. For example the search term bank~ will return rows that contain tank , benk or banks .
You can use the T-SQL algorithm to perform fuzzy matching, comparing two strings and returning a score between 1 and 0 (with 1 being an exact match). With this method, you can use fuzzy logic for address matching, which helps you account for partial matches.
Unfortunately, doing this in PHP is prohibitively expensive (high CPU and memory utilization.) However, you can certainly apply the algorithm to small data sets.
To specifically expand on how you can create a server meltdown: couple of built-in PHP functions will determine "distance" between strings: levenshtein and similar_text.
Dummy data: (pretend they're news headlines)
$titles = <<< EOF Apple Apples Orange Oranges Banana EOF;$titles = explode("\n", $titles );
At this point, $titles should just be an array of strings. Now, create a matrix and compare each headline against EVERY other headline for similarity. In other words, for 5 headlines, you will get a 5 x 5 matrix (25 entries.) That's where the CPU and memory sink goes in.
That's why this method (via PHP) can't be applied to thousands of entries. But if you wanted to:
$matches = array(); foreach( $titles as $title ) { $matches[$title] = array(); foreach( $titles as $compare_to ) { $matches[$title][$compare_to] = levenshtein( $compare_to, $title ); } asort( $matches[$title], SORT_NUMERIC ); }
At this point what you basically have is a matrix with "text distances." In concept (not in real data) it looks sort of like this table below. Note how there is a set of 0 values that go diagonally - that means that in the matching loop, two identical words are -- well, identical.
Apple Apples Orange Oranges Banana Apple 0 1 5 6 6 Apples 1 0 6 5 6 Orange 5 6 0 1 5 Oranges 6 5 1 0 5 Banana 6 6 5 5 0
The actual $matches array looks sort of like this (truncated):
Array ( [Apple] => Array ( [Apple] => 0 [Apples] => 1 [Orange] => 5 [Banana] => 6 [Oranges] => 6 ) [Apples] => Array ( ...
Anyhow, it's up to you to (by experimentation) determine what a good numerical distance cutoff might mostly match - and then apply it. Otherwise, read up on sphinx-search and use it - since it does have PHP libraries.
Orange you glad you asked about this?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With