I am facing problems with a PHP function for optimizing a search string for a MySql query.
I need to find an entry which look like 'hobbit, the' by searching for 'the hobbit'.
I thought about cutting the articles (in Germany we have 'der', 'die' and 'das') if they have a trailing space out of the search string.
My function looks like:
public function optimizeSearchString($searchString)
{
$articles = [
'der ',
'die ',
'das ',
'the '
];
foreach ($articles as $article) {
//only cut $article out of $searchString if its longer than the $article itself
if (strlen($searchString) > strlen($article) && strpos($searchString, $article)) {
$searchString = str_replace($article, '', $searchString);
break;
}
}
return $searchString;
}
But this doesn't work...
Maybe there is a nicer solution using regular expressions?
1.) To just remove one stopword from start or end of the string by using regex like this:
~^\W*(der|die|das|the)\W+\b|\b\W+(?1)\W*$~i
~
is the pattern delimiter^
the caret anchor matches start of the string\W
(upper) is a short for a character, that is not a word character(der|die|das|the)
alternation |
in first parenthesized group\b
matches a word boundary(?1)
the pattern of first group is pasted$
matches right after the last character in the stringi
(PCRE_CASELESS) flag. If input is utf-8, also need u
(PCRE_UTF8) flag.Reference - What does this regex mean
Generate the pattern:
// array containing stopwords
$stopwords = array("der", "die", "das", "the");
// escape the stopword array and implode with pipe
$s = '~^\W*('.implode("|", array_map("preg_quote", $stopwords)).')\W+\b|\b\W+(?1)\W*$~i';
// replace with emptystring
$searchString = preg_replace($s, "", $searchString);
Note that if ~
delimiter occurs in the $stopwords
array, it also has to be escaped with a backslash.
Regex pattern at regex101
2.) But to remove stop words anywhere in the string how about splitting into words:
// words to be removed
$stopwords = array(
'der' => 1,
'die' => 1,
'das' => 1,
'the' => 1);
# used words as key for better performance
// remove stopwords from string
function strip_stopwords($str = "")
{
global $stopwords;
// 1.) break string into words
// [^-\w\'] matches characters, that are not [0-9a-zA-Z_-']
// if input is unicode/utf-8, the u flag is needed: /pattern/u
$words = preg_split('/[^-\w\']+/', $str, -1, PREG_SPLIT_NO_EMPTY);
// 2.) if we have at least 2 words, remove stopwords
if(count($words) > 1)
{
$words = array_filter($words, function ($w) use (&$stopwords) {
return !isset($stopwords[strtolower($w)]);
# if utf-8: mb_strtolower($w, "utf-8")
});
}
// check if not too much was removed such as "the the" would return empty
if(!empty($words))
return implode(" ", $words);
return $str;
}
See ideone.com
// test it
echo strip_stopwords("The Hobbit das foo, der");
Hobbit foo
This solution will also remove any punctuation besides _
-
'
because it's imploding remaining words with space after removing the common words. The idea is to prepare the string for a query.
Both solutions don't modify the case and will leave the string if it only consists of one stopword.
Lists of common words
The solution provided by @Jonny 5 seems to be the best for my solution.
Now I use a function like this:
public function optimizeSearchString($searchString = "")
{
$stopwords = array(
'der' => 1,
'die' => 1,
'das' => 1,
'the' => 1);
$words = preg_split('/[^-\w\']+/', $searchString, -1, PREG_SPLIT_NO_EMPTY);
if (count($words) > 1) {
$words = array_filter($words, function ($v) use (&$stopwords) {
return !isset($stopwords[strtolower($v)]);
}
);
}
if (empty($words)) {
return $searchString;
}
return implode(" ", $words);
}
The new solution of Jonny 5 would work also, but I use this one, because I'm not that familiar with regex and I know whats going on :-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With