Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parse search string for phrases and keywords

i need to parse a search string for keywords and phrases in php, for example

string 1: value of "measured response" detect goal "method valuation" study

will yield: value,of,measured reponse,detect,goal,method valuation,study

i also need it to work if the string has:

  1. no phrases enclosed in quotes,
  2. any number of phrases encloses in quotes with any number of keywords outside the quotes,
  3. only phrases in quotes,
  4. only space-separated keywords.

i'm leaning towards using preg_match with the pattern '/(\".*\")/' to get the phrases into an array, then remove the phrases from the string, then finally work the keywords into the array. i just can't pull everything together!

i'm also thinking of replacing spaces outside quotes with commas. then explode them to an array. if that's a better option, how do i do that with preg_replace?

is there a better way to go about this? help! thanks much, everyone

like image 773
Ana Ban Avatar asked Oct 30 '11 05:10

Ana Ban


3 Answers

preg_match_all('/(?<!")\b\w+\b|(?<=")\b[^"]+/', $subject, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
    # Matched text = $result[0][$i];
}

This should yield the results you are looking for.

Explanation :

# (?<!")\b\w+\b|(?<=")\b[^"]+
# 
# Match either the regular expression below (attempting the next alternative only if this one fails) «(?<!")\b\w+\b»
#    Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!")»
#       Match the character “"” literally «"»
#    Assert position at a word boundary «\b»
#    Match a single character that is a “word character” (letters, digits, etc.) «\w+»
#       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
#    Assert position at a word boundary «\b»
# Or match regular expression number 2 below (the entire match attempt fails if this one fails to match) «(?<=")\b[^"]+»
#    Assert that the regex below can be matched, with the match ending at this position (positive lookbehind) «(?<=")»
#       Match the character “"” literally «"»
#    Assert position at a word boundary «\b»
#    Match any character that is NOT a “"” «[^"]+»
#       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
like image 119
FailedDev Avatar answered Sep 19 '22 14:09

FailedDev


There is no need to use a regular expression, the built in function str_getcsv can be used to explode a string with any given delimiter, enclosure and escape characters.

Really it is as simple as.

// where $string is the string to parse
$array = str_getcsv($string, ' ', '"'); 
like image 20
Fraser Avatar answered Sep 18 '22 14:09

Fraser


$s = 'value of "measured response" detect goal "method valuation" study';
preg_match_all('~(?|"([^"]+)"|(\S+))~', $s, $matches);
print_r($matches[1]);

output:

Array
(
    [0] => value
    [1] => of
    [2] => measured response
    [3] => detect
    [4] => goal
    [5] => method valuation
    [6] => study
)

The trick here is to use a branch-reset group: (?|...|...). It's just like an alternation contained in a non-capturing group - (?:...|...) - except that within each branch the capturing-group numbers start at the same number. (For more info, see the PCRE docs and search for DUPLICATE SUBPATTERN NUMBERS.)

Thus, the text we're interested in is always captured group #1. You can retrieve the contents of group #1 for all matches via $matches[1]. (That's assuming the PREG_PATTERN_ORDER flag is set; I didn't specify it like @FailedDev did because it's the default. See the PHP docs for details.)

like image 31
Alan Moore Avatar answered Sep 17 '22 14:09

Alan Moore