Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Skipping the header while loading the text file using Piglatin

I have a text file and it's first row contains the header. Now I want to do some operation on the data, but while loading the file using PigStorage it takes the HEADER too. I just want to skip the HEADER. Is it possible to do so(directly or through a UDF)?

This is the command which i'm using to load the data:

input_file = load '/home/hadoop/smdb_tracedata.csv'
USING PigStorage(',')
as (trans:chararray, carrier:chararray,aainday:chararray);
like image 443
Pawan Kumar Avatar asked Oct 01 '13 11:10

Pawan Kumar


2 Answers

Usually the way I solve this problem is to use a FILTER on something I know is in the header. For example, consider the following data example:

STATE,NAME
MD,Bob
VA,Larry

I'll do:

B = FILTER A BY state != 'STATE';
like image 174
Donald Miner Avatar answered Sep 28 '22 10:09

Donald Miner


If you have pig version 0.11 you could try this:

input_file = load '/home/hadoop/smdb_tracedata.csv' USING PigStorage(',') as (trans:chararray, carrier :chararray,aainday:chararray);

ranked = rank input_file;

NoHeader = Filter ranked by (rank_input_file > 1);

Ordered = Order NoHeader by rank_input_file

New_input_file = foreach Ordered Generate trans, carrier, aainday;

This would get rid of the first row, leaving New_input_file exactly the same as the original, without the header row (assuming header row is the first row in the file). Please note that the rank operator is only available in pig 0.11, so if you have an earlier version you will need to find another way.

Edit: added the ordered line in order to make sure New_input_file maintains the same order as the original input file

like image 33
Davis Broda Avatar answered Sep 28 '22 08:09

Davis Broda