Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regexp matching in pig

Using apache pig and the text

hahahah.  my brother just didnt do anything wrong. He cheated on a test? no way!

I'm trying to match "my brother just didnt do anything wrong."

Ideally, I'd want to match anything beginning with "my brother just" and end with either punctuation(end of sentence) or EOL.

Looking at the pig docs, and then following the link to java.util.regex.Pattern, I figure I should be able to use

extrctd = FOREACH fltr GENERATE FLATTEN(EXTRACT(txt,'(my brother just .*\\p{Punct})')) as (txt:chararray);

But that seems to match until the end of the line. Any suggestions for performing this match? I'm ready to pull my hair out, and by pull my hair out, I mean switch to python streaming

like image 466
Neil Kodner Avatar asked Jul 19 '10 21:07

Neil Kodner


People also ask

What is matching RegExp?

The Match(String, String, RegexOptions) method returns the first substring that matches a regular expression pattern in an input string. For information about the language elements used to build a regular expression pattern, see Regular Expression Language - Quick Reference.

What can be matched using (*) in a regular expression?

A regular expression followed by an asterisk ( * ) matches zero or more occurrences of the regular expression. If there is any choice, the first matching string in a line is used.

How do you match a character sequence in regex?

There is a method for matching specific characters using regular expressions, by defining them inside square brackets. For example, the pattern [abc] will only match a single a, b, or c letter and nothing else.

Is regex matching fast?

Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow.


1 Answers

By default quantifiers are greedy. This means they match as much as possible. In this case you want to match only up to the first punctuation mark. In other words you want to match as little as possible.

So to solve your problem you should make the quanitifer non greedy by adding a ? immediately after it:

my brother just .*?\\p{Punct}
                  ^

Note that the use of ? here is different from its use as a quantifier where it means 'match zero or one'.

like image 61
Mark Byers Avatar answered Oct 27 '22 14:10

Mark Byers