Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I add a header row to files created from Pig (Hadoop)?

I'm writing a pig latin script similar to the following:

A = load 'data' using PigStorage('\t');
store A into my_data using PigStorage();

This outputs

(Bob, 10, 4.0)
(Jim, 11, 3.25)
(Paul, 9, 2.75)

I'd like to add a first header row to each file stored in HDFS

(Name, Age, GPA)
(Bob, 10, 4.0)
(Jim, 11, 3.25)
(Paul, 9, 2.75)

Any ideas?

like image 730
Ryan Guest Avatar asked Jan 07 '13 21:01

Ryan Guest


2 Answers

You can use CSVExcelStorage as the storage function which allows you to do precisely what you want:

STORE output INTO '/outputfolder/' USING org.apache.pig.piggybank.storage.CSVExcelStorage('\t', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');

Using the "WRITE_OUTPUT_HEADER" option will write the header to every file which satisfies your use case.

like image 51
FirstName LastName Avatar answered Sep 27 '22 21:09

FirstName LastName


This doesn't really make sense for Pig. Each line is a separate record of data, and so unless there is really a person named Name, with an age of Age, and a GPA of GPA, having such a line is wrong. Also, Pig makes no guarantees about the order in which fields will be output (unless using ORDER BY), so your header row might show up anywhere.

What you are asking for is a way to keep your schema around after Pig is done with its work, so that you don't have to remember what it is or look it up somewhere. Starting with Pig 0.10, this has been possible with PigStorage by storing the schema of the relation as a JSON file .pig_schema, in the same directory as the output. See this page for more detailed information about what that is and how to use it.

like image 29
reo katoa Avatar answered Sep 27 '22 22:09

reo katoa