How can I add a header row to files created from Pig (Hadoop)?

Question

I'm writing a pig latin script similar to the following:

A = load 'data' using PigStorage('	');
store A into my_data using PigStorage();

This outputs

(Bob, 10, 4.0)
(Jim, 11, 3.25)
(Paul, 9, 2.75)

I'd like to add a first header row to each file stored in HDFS

(Name, Age, GPA)
(Bob, 10, 4.0)
(Jim, 11, 3.25)
(Paul, 9, 2.75)

Any ideas?

FirstName LastName · Accepted Answer

You can use CSVExcelStorage as the storage function which allows you to do precisely what you want:

STORE output INTO '/outputfolder/' USING org.apache.pig.piggybank.storage.CSVExcelStorage(' ', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');

Using the "WRITE_OUTPUT_HEADER" option will write the header to every file which satisfies your use case.

reo katoa · Answer

This doesn't really make sense for Pig. Each line is a separate record of data, and so unless there is really a person named Name, with an age of Age, and a GPA of GPA, having such a line is wrong. Also, Pig makes no guarantees about the order in which fields will be output (unless using ORDER BY), so your header row might show up anywhere.

What you are asking for is a way to keep your schema around after Pig is done with its work, so that you don't have to remember what it is or look it up somewhere. Starting with Pig 0.10, this has been possible with PigStorage by storing the schema of the relation as a JSON file .pig_schema, in the same directory as the output. See this page for more detailed information about what that is and how to use it.

How can I add a header row to files created from Pig (Hadoop)?

Tags:

hadoop

apache-pig

Ryan Guest

2 Answers

FirstName LastName

reo katoa

Recent Activity

Donate For Us

How can I add a header row to files created from Pig (Hadoop)?

Tags:

hadoop

apache-pig

Ryan Guest

2 Answers

FirstName LastName

reo katoa

Related questions

Recent Activity

Donate For Us