Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

file formats that can be read using PIG

What kind of file formats can be read using PIG?

How can I store them in different formats? Say we have CSV file and I want to store it as MXL file how this can be done? Whenever we use STORE command it makes directory and it stores file as part-m-00000 how can I change name of the file and overwrite directory?

like image 446
chhaya vishwakarma Avatar asked Jan 25 '12 11:01

chhaya vishwakarma


1 Answers

what kind of file formats can be read using PIG? how can i store them in different formats?

There are a few built-in loading and storing methods, but they are limited:

  • BinStorage - "binary" storage
  • PigStorage - loads and stores data that is delimited by something (such as tab or comma)
  • TextLoader - loads data line by line (i.e., delimited by the newline character)

piggybank is a library of community contributed user-defined functions and it has a number of loading and storing methods, which includes an XML loader, but not a XML storer.


say we have CSV file n i want to store it as MXL file how this can be done?

I assume you mean XML here... Storing in XML is something that is a bit rough in Hadoop because it splits files on a reducer basis, so how do you know where to put the root tag? this likely should be some sort of post-processing to produce wellformed XML.

One thing you can do is to write a UDF that converts your columns into an XML string:

B = FOREACH A GENERATE customudfs.DataToXML(col1, col2, col3);

For example, say col1, col2, col3 are "foo", 37, "lemons", respectively. Your UDF can output the string "<item><name>Foo</name><num>37</num><fruit>lemons</fruit></item>".


whenever we use STORE command it makes directory and it stores file as part-m-00000 how can i change name of the file and overwrite directory?

You can't change the name of the output file to be something other than part-m-00000. That's just how Hadoop works. If you want to change the name of it, you should do something to it after the fact with something like hadoop fs -mv output/part-m-00000 newoutput/myoutputfile. This could be done with a bash script that runs the pig script then executes this command.

like image 93
Donald Miner Avatar answered Oct 31 '22 09:10

Donald Miner