Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pig Latin: Load multiple files from a date range (part of the directory structure)

I have the following scenario-

Pig version used 0.70

Sample HDFS directory structure:

/user/training/test/20100810/<data files> /user/training/test/20100811/<data files> /user/training/test/20100812/<data files> /user/training/test/20100813/<data files> /user/training/test/20100814/<data files> 

As you can see in the paths listed above, one of the directory names is a date stamp.

Problem: I want to load files from a date range say from 20100810 to 20100813.

I can pass the 'from' and 'to' of the date range as parameters to the Pig script but how do I make use of these parameters in the LOAD statement. I am able to do the following

temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader() AS (...); 

The following works with hadoop:

hadoop fs -ls /user/training/test/{20100810..20100813} 

But it fails when I try the same with LOAD inside the pig script. How do I make use of the parameters passed to the Pig script to load data from a date range?

Error log follows:

Backend error message during job submission ------------------------------------------- org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)         at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:858)         at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:875)         at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793)         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:752)         at java.security.AccessController.doPrivileged(Native Method)         at javax.security.auth.Subject.doAs(Subject.java:396)         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)         at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:752)         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:726)         at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)         at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)         at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)         at java.lang.Thread.run(Thread.java:619) Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://<ServerName>.com/user/training/test/{20100810..20100813} matches 0 files         at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)         at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:258)         ... 14 more    Pig Stack Trace --------------- ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}  org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias test         at org.apache.pig.PigServer.openIterator(PigServer.java:521)         at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)         at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)         at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)         at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)         at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)         at org.apache.pig.Main.main(Main.java:357) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169) 

Do I need to make use of a higher language like Python to capture all date stamps in the range and pass them to LOAD as a comma separated list?

cheers

like image 600
Arnkrishn Avatar asked Aug 18 '10 18:08

Arnkrishn


People also ask

Which command is used to load the data from local file system in Pig?

The Load Operator. You can load data into Apache Pig from the file system (HDFS/ Local) using LOAD operator of Pig Latin.

What are the various statements used in flow of data processing in Pig Latin?

The Pig Latin statements are used to process the data. It is an operator that accepts a relation as an input and generates another relation as an output. It can span multiple lines. Each statement must end with a semi-colon.

What is dump command in Pig?

The Dump operator is used to run the Pig Latin statements and display the results on the screen. It is generally used for debugging Purpose.


2 Answers

As zjffdu said, the path expansion is done by the shell. One common way to solve your problem is to simply use Pig parameters (which is a good way to make your script more resuable anyway):

shell:

pig -f script.pig -param input=/user/training/test/{20100810..20100812} 

script.pig:

temp = LOAD '$input' USING SomeLoader() AS (...); 
like image 122
Romain Avatar answered Sep 24 '22 02:09

Romain


Pig is processing your file name pattern using the hadoop file glob utilities, not the shell's glob utilities. Hadoop's are documented here. As you can see, hadoop does not support the '..' operator for a range. It seems to me you have two options - either write out the {date1,date2,date2,...,dateN} list by hand, which if this is a rare use case is probably the way to go, or write a wrapper script which generates that list for you. Building such a list from a date range should be a trivial task for the scripting language of your choice. For my application, I've gone with the generated list route, and it's working fine (CHD3 distribution).

like image 36
Mark Tozzi Avatar answered Sep 21 '22 02:09

Mark Tozzi