Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP - Keyword matching in text strings - How to enhance the accuracy of returned keywords?

Tags:

php

I have a piece of PHP code as follows:

$words = array(
    'Art' => '1',
    'Sport' => '2',
    'Big Animals' => '3',
    'World Cup' => '4',
    'David Fincher' => '5',
    'Torrentino' => '6',
    'Shakes' => '7',
    'William Shakespeare' => '8'
    );
$text = "I like artists, and I like sports. Can you call the name of a big animal? Brazil World Cup matchers are very good. William Shakespeare is very famous in the world.";
$all_keywords = $all_keys = array();
foreach ($words as $word => $key) {
    if (strpos(strtolower($text), strtolower($word)) !== false) {
        $all_keywords[] = $word;
        $all_keys[] = $key;
    }
}
        echo $keywords_list = implode(',', $all_keywords) ."<br>";
        echo $keys_list = implode(',', $all_keys) . "<br>";

The code echos Art,Sport,World Cup,Shakes,William Shakespeare and 1,2,4,7,8; however, the code is very simple and is not accurate enough to echo the right keywords. For example, the code returns 'Shakes' => '7' because of the Shakespeare word in $text, but as you can see, "Shakes" can not represent "Shakespeare" as a proper keyword. Basically I want to return Art,Sport,World Cup,William Shakespeare and 1,2,4,8 instead of Art,Sport,World Cup,Shakes,William Shakespeare and 1,2,4,7,8. So, could you please help me how to develop a better code to extract the keywords without having similar problems? thanks for your help.

like image 558
Sami Avatar asked Mar 19 '23 03:03

Sami


1 Answers

You may want to look at regular expressions to weed out partial matches:

// create regular expression by using alternation
// of all given words
$re = '/\b(?:' . join('|', array_map(function($keyword) {
    return preg_quote($keyword, '/');
}, array_keys($words))) . ')\b/i';

preg_match_all($re, $text, $matches);
foreach ($matches[0] as $keyword) {
    echo $keyword, " ", $words[$keyword], "\n";
}

The expression uses the \b assertion to match word boundaries, i.e. the word must be on its own.

Output

World Cup 4
William Shakespeare 8
like image 126
Ja͢ck Avatar answered Apr 07 '23 02:04

Ja͢ck