How does Pig use Hadoop Globs in a 'load' statement?

Tags:

apache-pig

As I've noted previously, Pig doesn't cope well with empty (0-byte) files. Unfortunately, there are lots of ways that these files can be created (even within Hadoop utilitities).

I thought that I could work around this problem by explicitly loading only files that match a given naming convention in the LOAD statement using Hadoop's glob syntax. Unfortunately, this doesn't seem to work, as even when I use a glob to filter down to known-good input files, I still run into the 0-byte failure mentioned earlier.

Here's an example: Assume I have the following files in S3:

mybucket/a/b/ (0 bytes)
mybucket/a/b/myfile.log (>0 bytes)
mybucket/a/b/yourfile.log (>0 bytes)

If I use a LOAD statement like this in my pig script:

myData = load 's3://mybucket/a/b/*.log as ( ... )

I would expect that Pig would not choke on the 0-byte file, but it still does. Is there a trick to getting Pig to actually only look at files that match the expected glob pattern?

554

asked Apr 21 '11 23:04

Chris Phillips

1 Answers

This is a fairly ugly solution, but globs that don't rely on the * wildcard syntax appear to work. So, in our workflow (before calling our pig script), we list all of the files below the prefix we're interested, and then create a specific glob that consists of only the paths we're interested in.

For example, in the example above, we list "mybucket/a":

hadoop fs -lsr s3://mybucket/a

Which returns a list of files, plus other metadata. We can then create the glob from that data:

myData = load 's3://mybucket/a/b{/myfile.log,/yourfile.log}' as ( ... )

This requires a bit more front-end work, but allows us to specifically target files we're interested and avoid 0-byte files.

Update: Unfortunately, I've found that this solution fails when the glob pattern gets long; Pig ends up throwing an exception "Unable to create input slice".

185

answered Nov 15 '22 08:11

Chris Phillips

Related questions
                            
                                Hadoop and Stata
                            
                                How to interpret MapReduce Performance Counters
                            
                                How to set up Hadoop in Docker Swarm?
                            
                                pyspark : how to check if a file exists in hdfs
                            
                                "LOST" node in EMR Cluster
                            
                                Making spark use /etc/hosts file for binding in YARN cluster mode
                            
                                Passing additional parameters to dbConnect function for JDBCDriver in R
                            
                                Not able to install hadoop using Cloudera Manager
                            
                                Why would Spark choose to do all work on a single node?
                            
                                HBASE 0.94.1 compatibility with hadoop
                            
                                Hadoop Ports Clarification
                            
                                could to find or load main class org.apache.nutch.crawl.InjectorJob
                            
                                Spark 1.3.0 on YARN: Application failed 2 times due to AM Container
                            
                                Why would someone run Spark / Flink on Tez?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does Pig use Hadoop Globs in a 'load' statement?

Tags:

hadoop

apache-pig

Chris Phillips

People also ask

1 Answers

Chris Phillips

Recent Activity

Donate For Us