Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

php (fuzzy) search matching

if anyone has ever submitted a story to digg, it checks whether or not the story is already submitted, I assume by a fuzzy search.

I would like to implement something similar and want to know if they are using a php class that is open source?

Soundex isnt doing it, sentences/strings can be up to 250chars in length

like image 672
chris Avatar asked Jul 08 '10 23:07

chris


People also ask

Is fuzzy matching good?

Fuzzy string matching can help improve data quality and accuracy by data deduplication, identification of false-positives etc.

What is fuzzy data matching?

Fuzzy matching (FM), also known as fuzzy logic, approximate string matching, fuzzy name matching, or fuzzy string matching is an artificial intelligence and machine learning technology that identifies similar, but not identical elements in data table sets.

What is fuzzy search example?

Fuzzy searches help you find relevant results even when the search terms are misspelled. To perform a fuzzy search, append a tilde (~) at the end of the search term. For example the search term bank~ will return rows that contain tank , benk or banks .

What is fuzzy matching in SQL?

You can use the T-SQL algorithm to perform fuzzy matching, comparing two strings and returning a score between 1 and 0 (with 1 being an exact match). With this method, you can use fuzzy logic for address matching, which helps you account for partial matches.


1 Answers

Unfortunately, doing this in PHP is prohibitively expensive (high CPU and memory utilization.) However, you can certainly apply the algorithm to small data sets.

To specifically expand on how you can create a server meltdown: couple of built-in PHP functions will determine "distance" between strings: levenshtein and similar_text.

Dummy data: (pretend they're news headlines)

$titles = <<< EOF
Apple
Apples
Orange
Oranges
Banana
EOF;

$titles = explode("\n", $titles );

At this point, $titles should just be an array of strings. Now, create a matrix and compare each headline against EVERY other headline for similarity. In other words, for 5 headlines, you will get a 5 x 5 matrix (25 entries.) That's where the CPU and memory sink goes in.

That's why this method (via PHP) can't be applied to thousands of entries. But if you wanted to:

$matches = array();
foreach( $titles as $title ) {
    $matches[$title] = array();
    foreach( $titles as $compare_to ) {
        $matches[$title][$compare_to] = levenshtein( $compare_to, $title );
    }
    asort( $matches[$title], SORT_NUMERIC  );
}

At this point what you basically have is a matrix with "text distances." In concept (not in real data) it looks sort of like this table below. Note how there is a set of 0 values that go diagonally - that means that in the matching loop, two identical words are -- well, identical.

       Apple Apples Orange Oranges Banana
Apple    0     1      5      6       6
Apples   1     0      6      5       6
Orange   5     6      0      1       5
Oranges  6     5      1      0       5
Banana   6     6      5      5       0

The actual $matches array looks sort of like this (truncated):

Array
(
    [Apple] => Array
        (
            [Apple] => 0
            [Apples] => 1
            [Orange] => 5
            [Banana] => 6
            [Oranges] => 6
        )

    [Apples] => Array
        (
      ...

Anyhow, it's up to you to (by experimentation) determine what a good numerical distance cutoff might mostly match - and then apply it. Otherwise, read up on sphinx-search and use it - since it does have PHP libraries.

Orange you glad you asked about this?

like image 146
pp19dd Avatar answered Oct 23 '22 02:10

pp19dd