Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Javascript regex pattern match multiple strings ( AND, OR ) against single string

I need to filter a collection of strings based on a rather complex query - in it's "raw" form it looks like this:

nano* AND (regulat* OR *toxic* OR ((risk OR hazard) AND (exposure OR release)) )

An example of one of the strings to match against:

Workshop on the Second Regulatory Review on Nanomaterials, 30 January 2013, Brussels

So, I need to match using AND OR and wildcard characters - so, I presume I'll need to use a regex in JavaScript.

I have it all looping correctly, filtering and generally working, but I'm 100% sure my regex is wrong - and some results are being omitted wrongly - here it is:

/(nano[a-zA-Z])?(regulat[a-zA-Z]|[a-zA-Z]toxic[a-zA-Z]|((risk|hazard)*(exposure|release)))/i

Any help would be greatly appreciated - I really can't abstract my mind correctly to understand this syntax!

UPDATE:

Few people are point out the importance of the order in which the regex is constructed, however I have no control over the text strings that will be searched, so I need to find a solution that can work regardless of the order or either.

UPDATE:

Eventually used a PHP solution, due to deprecation of twitter API 1.0, see pastebin for example function ( I know it's better to paste code here, but there's a lot... ):

function: http://pastebin.com/MpWSGtHK usage: http://pastebin.com/pP2AHEvk

Thanks for all help

like image 786
Q Studio Avatar asked Feb 26 '13 13:02

Q Studio


2 Answers

A single regex is not the right tool for this, IMO:

/^(?=.*\bnano)(?=(?:.*\bregulat|.*toxic|(?=.*(?:\brisk\b|\bhazard\b))(?=.*(?:\bexposure\b|\brelease\b))))/i.test(subject))

would return True if the string fulfills the criteria you set forth, but I find nested lookaheads quite incomprehensible. If JavaScript supported commented regexes, it would look like this:

^                 # Anchor search to start of string
(?=.*\bnano)      # Assert that the string contains a word that starts with nano
(?=               # AND assert that the string contains...
 (?:              #  either
  .*\bregulat     #   a word starting with regulat
 |                #  OR
  .*toxic         #   any word containing toxic
 |                #  OR
  (?=             #   assert that the string contains
   .*             #    any string
   (?:            #    followed by
    \brisk\b      #    the word risk
   |              #    OR
    \bhazard\b    #    the word hazard
   )              #    (end of inner OR alternation)
  )               #   (end of first AND condition)
  (?=             #   AND assert that the string contains
   .*             #    any string
   (?:            #    followed by
    \bexposure\b  #    the word exposure
   |              #    OR
    \brelease\b   #    the word release
   )              #    (end of inner OR alternation)
  )               #   (end of second AND condition)
 )                #  (end of outer OR alternation)
)                 # (end of lookahead assertion)

Note that the entire regex is composed of lookahead assertions, so the match result itself will always be the empty string.

Instead, you could use single regexes:

if (/\bnano/i.test(str) &&
    ( 
        /\bregulat|toxic/i.test(str) ||
        ( 
            /\b(?:risk|hazard)\b/i.test(str) &&
            /\b(?:exposure|release)\b/i.test(str)
        )
    )
)    /* all tests pass */
like image 196
Tim Pietzcker Avatar answered Sep 27 '22 18:09

Tim Pietzcker


Regular expressions have to move through the string in order. You have "nano" before "regulat" in the pattern, but they are swapped in the test string. Instead of using regexen to do this, I'd stick with plain old string parsing:

if (str.indexOf('nano') > -1) {
    if (str.indexOf('regulat') > -1 || str.indexOf('toxic') > -1
        || ((str.indexOf('risk') > - 1 || str.indexOf('hazard') > -1)
        && (str.indexOf('exposure') > -1 || str.indexOf('release') > -1)
    )) {
        /* all tests pass */
    }
}

If you want to actually capture the words (e.g. get "Regulatory" from where "regulat" is, I would split the sentence by word breaks and inspect individual words.

like image 23
Explosion Pills Avatar answered Sep 27 '22 19:09

Explosion Pills