As I've noted previously, Pig doesn't cope well with empty (0-byte) files. Unfortunately, there are lots of ways that these files can be created (even within Hadoop utilitities).
I thought that I could work around this problem by explicitly loading only files that match a given naming convention in the LOAD statement using Hadoop's glob syntax. Unfortunately, this doesn't seem to work, as even when I use a glob to filter down to known-good input files, I still run into the 0-byte failure mentioned earlier.
Here's an example: Assume I have the following files in S3:
If I use a LOAD statement like this in my pig script:
myData = load 's3://mybucket/a/b/*.log as ( ... )
I would expect that Pig would not choke on the 0-byte file, but it still does. Is there a trick to getting Pig to actually only look at files that match the expected glob pattern?
Now load the data from the file student_data. txt into Pig by executing the following Pig Latin statement in the Grunt shell. grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
The PigStorage() function loads and stores data as structured text files. It takes a delimiter using which each entity of a tuple is separated as a parameter.
It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Apache Pig. To write data analysis programs, Pig provides a high-level language known as Pig Latin.
Which of the following function is used to read data in PIG? Explanation: PigStorage is the default load function. 7.
This is a fairly ugly solution, but globs that don't rely on the *
wildcard syntax appear to work. So, in our workflow (before calling our pig script), we list all of the files below the prefix we're interested, and then create a specific glob that consists of only the paths we're interested in.
For example, in the example above, we list "mybucket/a":
hadoop fs -lsr s3://mybucket/a
Which returns a list of files, plus other metadata. We can then create the glob from that data:
myData = load 's3://mybucket/a/b{/myfile.log,/yourfile.log}' as ( ... )
This requires a bit more front-end work, but allows us to specifically target files we're interested and avoid 0-byte files.
Update: Unfortunately, I've found that this solution fails when the glob pattern gets long; Pig ends up throwing an exception "Unable to create input slice".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With