Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimized regex for N words around a given word (UTF-8)

I'm trying to find an optimized regex to return the N words (if available) around another one to build a summary. The string is in UTF-8, so the definition of "words" is larger than just [a-z]. The string that serves as the reference word could be in the middle of a word or not directly surrounded by spaces.

I've already got the following that works but seems actually greedy and chokes when looking for more than 6-7 words around another one:

/(?:[^\s\r\n]+[\s\r\n]+[^\s\r\n]*){0,4}lorem(?:[^\s\r\n]*[\s\r\n]+[^\s\r\n]+){0,4}/u

This is the PHP method I've build to do that but I'd need help getting the regex to be less greedy and work for any number of words around.

/**
 * Finds N words around a specified word in a string.
 *
 * @param string $string The complete string to look in.
 * @param string $find The string to look for.
 * @param integer $before The number of words to look for before $find.
 * @param integer $after The number of words to look for after $find.
 * @return mixed False if $find was not found and all the words around otherwise.
 */
private function getWordsAround($string, $find, $before, $after)
{
    $matches = array();
    $find = preg_quote($find);
    $regex = '(?:[^\s\r\n]+[\s\r\n]+[^\s\r\n]*){0,' . (int)$before . '}' .
        $find . '(?:[^\s\r\n]*[\s\r\n]+[^\s\r\n]+){0,' . (int)$after . '}';
    if (preg_match("/$regex/u", $string, $matches)) {
        return $matches[0];
    } else {
        return false;
    }
}

If I had the following $string:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, 
felis non vehicula suscipit, enim quam adipiscing turpis, eget rutrum 
eros velit non enim. Sed commodo cursus vulputate. Aliquam id diam sed arcu 
fringilla venenatis. Cras vitae ante ut tellus malesuada convallis. Vivamus 
luctus ante vel ligula eleifend condimentum. Donec a vulputate velit. 
Suspendisse velit risus, volutpat at dapibus vitae, viverra vel nulla."

And called getWordsAround($string, 'vitae', 8, 8) I'd want to get the following result:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, 
felis non vehicula suscipit,"

Thank you for your help regex gurus.

like image 515
lpfavreau Avatar asked Aug 27 '10 15:08

lpfavreau


1 Answers

What about using a regex or some other method to split the input text into an array of words. Then run through the words with a loop looking for the target word. Once it's found, then grab the required array slice, join it together and print.

To maintain the original whitespace between words, you can include it at the end of each word.

Also, this could be implemented as a stream parser rather than splitting the whole string first.

like image 119
a'r Avatar answered Sep 20 '22 02:09

a'r