Hadoop Pig - Removing csv header

Question

My csv files have header in the first line. Loading them into pig create a mess on any subsequent functions (like SUM). As of today I first apply a filter on the loaded data to remove the rows containing the headers :

affaires    = load 'affaires.csv'   using PigStorage(',') as (NU_AFFA:chararray,    date:chararray) ;
affaires    = filter affaires by date matches '../../..';

I think it is a bit stupid as a method, and I am wondering either there is a way to tell pig not to load the first line of the csv, like a "as_header" boolean parameter to the load function. I don't see it on the doc. What would be a best practice ? How do you usually deal with that ??

Sivasakthi Jayaraman · Accepted Answer

CSVExcelStorage loader support to skip the header row, so instead of PigStorage use CSVExcelStorage. Download piggybank.jar and try this option.

Sample example

input.csv

Name,Age,Location
a,10,chennai
b,20,banglore

PigScript:(With SKIP_INPUT_HEADER option)

REGISTER '/tmp/piggybank.jar';
A  = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER');
DUMP A;

Output:

(a,10,chennai)
(b,20,banglore)

Reference:
http://pig.apache.org/docs/r0.13.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html

Hadoop Pig - Removing csv header

Tags:

csv

hadoop

apache-pig

Romain Jouin

1 Answers

Sivasakthi Jayaraman

Recent Activity

Donate For Us

Hadoop Pig - Removing csv header

Tags:

csv

hadoop

apache-pig

Romain Jouin

1 Answers

Sivasakthi Jayaraman

Related questions

Recent Activity

Donate For Us