I have a text file and it's first row contains the header. Now I want to do some operation on the data, but while loading the file using PigStorage it takes the HEADER too. I just want to skip the HEADER. Is it possible to do so(directly or through a UDF)?
This is the command which i'm using to load the data:
input_file = load '/home/hadoop/smdb_tracedata.csv'
USING PigStorage(',')
as (trans:chararray, carrier:chararray,aainday:chararray);
Usually the way I solve this problem is to use a FILTER on something I know is in the header. For example, consider the following data example:
STATE,NAME
MD,Bob
VA,Larry
I'll do:
B = FILTER A BY state != 'STATE';
If you have pig version 0.11 you could try this:
input_file = load '/home/hadoop/smdb_tracedata.csv' USING PigStorage(',') as (trans:chararray, carrier :chararray,aainday:chararray);
ranked = rank input_file;
NoHeader = Filter ranked by (rank_input_file > 1);
Ordered = Order NoHeader by rank_input_file
New_input_file = foreach Ordered Generate trans, carrier, aainday;
This would get rid of the first row, leaving New_input_file exactly the same as the original, without the header row (assuming header row is the first row in the file). Please note that the rank operator is only available in pig 0.11, so if you have an earlier version you will need to find another way.
Edit: added the ordered line in order to make sure New_input_file maintains the same order as the original input file
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With