Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest PHP Routine To Match Words

What is the fastest way in PHP to take a keyword list and match it to a search result (like an array of titles) for all words?

For instance, if my keyword phrase is "great leather shoes", then the following titles would be a match...

  • Get Some Really Great Leather Shoes
  • Leather Shoes Are Great
  • Great Day! Those Are Some Cool Leather Shoes!
  • Shoes, Made of Leather, Can Be Great

...while these would not be a match:

  • Leather Shoes on Sale Today!
  • You'll Love These Leather Shoes Greatly
  • Great Shoes Don't Come Cheap

I imagine there's some trick with array functions or a RegEx (Regular Expression) to achieve this rapidly.

like image 837
Volomike Avatar asked Apr 13 '10 20:04

Volomike


4 Answers

I would use an index for the words in the titles and test if every search term is in that index:

$terms = explode(' ', 'great leather shoes');
$titles = array(
    'Get Some Really Great Leather Shoes',
    'Leather Shoes Are Great',
    'Great Day! Those Are Some Cool Leather Shoes!',
    'Shoes, Made of Leather, Can Be Great'
);
foreach ($titles as $title) {
    // extract words in lowercase and use them as key for the word index
    $wordIndex = array_flip(preg_split('/\P{L}+/u', mb_strtolower($title), -1, PREG_SPLIT_NO_EMPTY));
    // look up if every search term is in the index
    foreach ($terms as $term) {
        if (!isset($wordIndex[$term])) {
            // if one is missing, continue with the outer foreach
            continue 2;
        }
    }
    // echo matched title
    echo "match: $title";
}
like image 81
Gumbo Avatar answered Nov 07 '22 15:11

Gumbo


you can preg_grep() your array against something like

 /^(?=.*?\bgreat)(?=.*?\bleather)(?=.*?\shoes)/

or (probably faster) grep each word separately and then array_intersect the results

like image 34
user187291 Avatar answered Nov 07 '22 16:11

user187291


It might be a pretty naive solution (quite possibly there are more efficient/elegant solutions), but I'ld probably do something like the following:

$keywords = array(
    'great',
    'leather',
    'shoes'
);

$titles = array(
    'Get Some Really Great Leather Shoes',
    'Leather Shoes Are Great',
    'Great Day! Those Are Some Cool Leather Shoes!',
    'Shoes, Made of Leather, Can Be Great',
    'Leather Shoes on Sale Today!',
    'You\'ll Love These Leather Shoes Greatly',
    'Great Shoes Don\'t Come Cheap'
);

$matches = array();
foreach( $titles as $title )
{
  $wordsInTitle = preg_split( '~\b(\W+\b)?~', $title, null, PREG_SPLIT_NO_EMPTY );
  if( array_uintersect( $keywords, $wordsInTitle, 'strcasecmp' ) == $keywords )
  {
    // we have a match
    $matches[] = $title;
  }
}

var_dump( $matches );

No idea how this benchmarks though.

like image 38
Decent Dabbler Avatar answered Nov 07 '22 16:11

Decent Dabbler


You could use

/(?=.*?\great\b)(?=.*?\bshoes\b)(?=.*?\bleather\b)/

Note a couple of things

a)You need word boundaries at both ends else you could end up matching words that contain the ones you are looking for eg "shoes of leather bring greatness".

b)I use lazy wildcard match (i.e .*?). This improves effeciency, as by default * is greedy (i.e. it consumes as many characters as it can match, and only gives them up in favor of a overall match). So if we don't have the trailing ?, .* will match everything in the line and then backtrack to match 'great'. Same procedure is then repeated for 'shoes' and 'leather'. By making * lazy, we avoid these unnecessary backtracks.

like image 26
Jasmeet Avatar answered Nov 07 '22 17:11

Jasmeet