Since there is no else or default statements in pig split operation what would be the most elegant way to do the following? I'm not a big fan of having to copy paste code.
SPLIT rawish_data
INTO good_rawish_data IF (
(uid > 0L) AND
(value1 > 0) AND
(value1 < 100) AND
(value1 IS NOT NULL) AND
(value2 > 0L) AND
(value2 < 200L) AND
(value3 >= 0) AND
(value3 <= 300)),
bad_rawish_data IF (NOT (
(uid > 0L) AND
(value1 > 0) AND
(value1 < 100) AND
(value1 IS NOT NULL) AND
(value2 > 0L) AND
(value2 < 200L) AND
(value3 >= 0) AND
(value3 <= 300)));
I would like to do something like
SPLIT data
INTO good_data IF (
(value > 0)),
good_data_big_values IF (
(value > 100)),
bad_data DEFAULT;
Is anything like this possible in anyway?
It is. Checking out the docs for SPLIT
, you want to use OTHERWISE
. For example:
SPLIT data
INTO good_data IF (
(value > 0)),
good_data_big_values IF (
(value > 100)),
bad_data OTHERWISE;
So you almost got it. :)
NOTE: SPLIT
can put a single row into both good_data
and good_data_big_values
if, for example, value
was 150. I don't know if this is what you want, but you should be aware of it regardless. This also means that bad_data
will only contain rows where value
is 0 or less.
You could write an IsGood() UDF where all the conditions are checked. Then your pig is simply
SPLIT data
INTO good_data IF (IsGood(data))
good_data_big_values IF (IsGood(data) AND value > 100)),
bad_data IF (NOT IsGood(data))
;
Another option might be to use a macro
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With