Apache Pig can load data from Hadoop sequence files using the PiggyBank <code>SequenceFileLoader</code>: <code>REGISTER /home/hadoop/pig/contrib/piggybank/java/piggybank.jar;</code> <code>DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();</code> <code>log = LOAD '/data/logs' USING SequenceFileLoader AS (...)</code> Is there also a library out there that would allow writing to Hadoop sequence files from Pig?

It's just a matter of implementing a StoreFunc to do so. This is possible now, although it will become a fair bit easier once Pig 0.7 comes out, as it includes a complete redesign of the Load/Store interfaces. The "Hadoop expansion pack" Twitter <strike>is about to open source</strike> open-sourced at github, includes code for generating Load and Store funcs based on Google Protocol Buffers (building on Input/Output formats for same -- you already have those for sequence files, obviously). Check it out if you need examples of how to do some of the less trivial stuff. It should be fairly straightforward though.

Storing data to SequenceFile from Apache Pig

1 Answers

It's just a matter of implementing a StoreFunc to do so.

This is possible now, although it will become a fair bit easier once Pig 0.7 comes out, as it includes a complete redesign of the Load/Store interfaces.

The "Hadoop expansion pack" Twitter ~~is about to open source~~ open-sourced at github, includes code for generating Load and Store funcs based on Google Protocol Buffers (building on Input/Output formats for same -- you already have those for sequence files, obviously). Check it out if you need examples of how to do some of the less trivial stuff. It should be fairly straightforward though.

answered Oct 13 '22 19:10

SquareCog

Related questions
                            
                                Spark SQL unable to complete writing Parquet data with a large number of shards
                            
                                hadoop Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit
                            
                                Spark driver disassociated and removed by the master
                            
                                Using hive table over parquet in Pig
                            
                                TIMESTAMP format issue in HIVE
                            
                                Spark: saveAsTextFile() only creating SUCCESS file and no part file when writing to local filesystem
                            
                                Unable to load libhdfs when using pyarrow
                            
                                Reading data from S3 using pyspark throws java.lang.NumberFormatException: For input string: "100M"
                            
                                WARN snappy.LoadSnappy: Snappy native library not loaded
                            
                                Saving garbage collection logs into ${yarn.nodemanager.log-dirs}/application_${appid}/container_${contid} for mappers and reducers on Hadoop Yarn
                            
                                Amazon MapReduce best practices for logs analysis
                            
                                Cross product in MapReduce
                            
                                When using HBase as a source for MapReduce, can I extend TableInputFormatBase to create multiple splits and multiple mappers for each region?
                            
                                Spark Streaming with a dynamic lookup table
                            
                                How to get a spark job's metrics?
                            
                                How to configure logging in Hadoop / HDP components?
                            
                                Python write to hdfs file
                            
                                Should Hadoop FileSystem be closed?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Storing data to SequenceFile from Apache Pig

Tags:

hadoop

apache-pig

asquithea

People also ask

1 Answers

SquareCog

Recent Activity

Donate For Us