I'm writing a pig latin script similar to the following:
A = load 'data' using PigStorage('\t');
store A into my_data using PigStorage();
This outputs
(Bob, 10, 4.0)
(Jim, 11, 3.25)
(Paul, 9, 2.75)
I'd like to add a first header row to each file stored in HDFS
(Name, Age, GPA)
(Bob, 10, 4.0)
(Jim, 11, 3.25)
(Paul, 9, 2.75)
Any ideas?
You can use CSVExcelStorage as the storage function which allows you to do precisely what you want:
STORE output INTO '/outputfolder/' USING org.apache.pig.piggybank.storage.CSVExcelStorage('\t', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');
Using the "WRITE_OUTPUT_HEADER" option will write the header to every file which satisfies your use case.
This doesn't really make sense for Pig. Each line is a separate record of data, and so unless there is really a person named Name
, with an age of Age
, and a GPA of GPA
, having such a line is wrong. Also, Pig makes no guarantees about the order in which fields will be output (unless using ORDER BY
), so your header row might show up anywhere.
What you are asking for is a way to keep your schema around after Pig is done with its work, so that you don't have to remember what it is or look it up somewhere. Starting with Pig 0.10, this has been possible with PigStorage
by storing the schema of the relation as a JSON file .pig_schema, in the same directory as the output. See this page for more detailed information about what that is and how to use it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With