Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Pig - MATCHES with multiple match criteria

I am trying to take a logical match criteria like:

(("Foo" OR "Foo Bar" OR FooBar) AND ("test" OR "testA" OR "TestB")) OR TestZ

and apply this as a match against a file in pig using

result = filter inputfields by text matches (some regex expression here));

The problem is I have no idea how to trun the logical expression above into a regex expression for the matches method.

I have fiddled around with various things and the closest I have come to is something like this:

((?=.*?\bFoo\b | \bFoo Bar\b))(?=.*?\bTestZ\b)

Any ideas? I also need to try to do this conversion programatically if possible.

Some examples:

a - The quick brown Foo jumped over the lazy test (This should pass as it contains foo and test)

b - the was something going on in TestZ (This passes also as it contains testZ)

c - the quick brown Foo jumped over the lazy dog (This should fail as it contains Foo but not test,testA or TestB)

Thanks

like image 619
user2495234 Avatar asked Sep 01 '13 11:09

user2495234


2 Answers

Since you're using Pig you don't actually need an involved regular expression, you can just use the boolean operators supplied by pig combined with a couple of easy regular expressions, example:

T = load 'matches.txt' as (str:chararray);
F = filter T by ((str matches '.*(Foo|Foo Bar|FooBar).*' and str matches '.*(test|testA|TestB).*') or str matches '.*TestZ.*');
dump F;
like image 145
jkovacs Avatar answered Oct 31 '22 13:10

jkovacs


You can use this regex for matches method

^((?=.*\\bTestZ\\b)|(?=.*\\b(FooBar|Foo Bar|Foo)\\b)(?=.*\\b(testA|testB|test)\\b)).*
  • note that "Foo" OR "Foo Bar" OR "FooBar" should be written as FooBar|Foo Bar|Foo not Foo|Foo Bar|FooBar to prevent matching only Foo in string containing FooBar or Foo Bar
  • also since look-ahead is zero-width you need to pass .* at the end of regex to let matches match entire string.

Demo

String[] data = { "The quick brown Foo jumped over the lazy test",
        "the was something going on in TestZ",
        "the quick brown Foo jumped over the lazy dog" };
String regex = "^((?=.*\\bTestZ\\b)|(?=.*\\b(FooBar|Foo Bar|Foo)\\b)(?=.*\\b(testA|testB|test)\\b)).*";
for (String s : data) {
    System.out.println(s.matches(regex) + " : " + s);
}

output:

true : The quick brown Foo jumped over the lazy test
true : the was something going on in TestZ
false : the quick brown Foo jumped over the lazy dog
like image 25
Pshemo Avatar answered Oct 31 '22 12:10

Pshemo