Using apache pig and the text
hahahah. my brother just didnt do anything wrong. He cheated on a test? no way!
I'm trying to match "my brother just didnt do anything wrong."
Ideally, I'd want to match anything beginning with "my brother just" and end with either punctuation(end of sentence) or EOL.
Looking at the pig docs, and then following the link to java.util.regex.Pattern, I figure I should be able to use
extrctd = FOREACH fltr GENERATE FLATTEN(EXTRACT(txt,'(my brother just .*\\p{Punct})')) as (txt:chararray);
But that seems to match until the end of the line. Any suggestions for performing this match? I'm ready to pull my hair out, and by pull my hair out, I mean switch to python streaming
The Match(String, String, RegexOptions) method returns the first substring that matches a regular expression pattern in an input string. For information about the language elements used to build a regular expression pattern, see Regular Expression Language - Quick Reference.
A regular expression followed by an asterisk ( * ) matches zero or more occurrences of the regular expression. If there is any choice, the first matching string in a line is used.
There is a method for matching specific characters using regular expressions, by defining them inside square brackets. For example, the pattern [abc] will only match a single a, b, or c letter and nothing else.
Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow.
By default quantifiers are greedy. This means they match as much as possible. In this case you want to match only up to the first punctuation mark. In other words you want to match as little as possible.
So to solve your problem you should make the quanitifer non greedy by adding a ?
immediately after it:
my brother just .*?\\p{Punct} ^
Note that the use of ?
here is different from its use as a quantifier where it means 'match zero or one'.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With