Pig Latin: Load multiple files from a date range (part of the directory structure)

Tags:

apache-pig

I have the following scenario-

Pig version used 0.70

Sample HDFS directory structure:

/user/training/test/20100810/<data files> /user/training/test/20100811/<data files> /user/training/test/20100812/<data files> /user/training/test/20100813/<data files> /user/training/test/20100814/<data files>

As you can see in the paths listed above, one of the directory names is a date stamp.

Problem: I want to load files from a date range say from 20100810 to 20100813.

I can pass the 'from' and 'to' of the date range as parameters to the Pig script but how do I make use of these parameters in the LOAD statement. I am able to do the following

temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader() AS (...);

The following works with hadoop:

hadoop fs -ls /user/training/test/{20100810..20100813}

But it fails when I try the same with LOAD inside the pig script. How do I make use of the parameters passed to the Pig script to load data from a date range?

Error log follows:

Backend error message during job submission ------------------------------------------- org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)         at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:858)         at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:875)         at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793)         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:752)         at java.security.AccessController.doPrivileged(Native Method)         at javax.security.auth.Subject.doAs(Subject.java:396)         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)         at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:752)         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:726)         at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)         at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)         at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)         at java.lang.Thread.run(Thread.java:619) Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://<ServerName>.com/user/training/test/{20100810..20100813} matches 0 files         at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)         at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:258)         ... 14 more    Pig Stack Trace --------------- ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}  org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias test         at org.apache.pig.PigServer.openIterator(PigServer.java:521)         at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)         at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)         at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)         at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)         at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)         at org.apache.pig.Main.main(Main.java:357) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}         at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169)

Do I need to make use of a higher language like Python to capture all date stamps in the range and pass them to LOAD as a comma separated list?

cheers

600

asked Aug 18 '10 18:08

2 Answers

As zjffdu said, the path expansion is done by the shell. One common way to solve your problem is to simply use Pig parameters (which is a good way to make your script more resuable anyway):

shell:

pig -f script.pig -param input=/user/training/test/{20100810..20100812}

script.pig:

temp = LOAD '$input' USING SomeLoader() AS (...);

122

answered Sep 24 '22 02:09

Pig is processing your file name pattern using the hadoop file glob utilities, not the shell's glob utilities. Hadoop's are documented here. As you can see, hadoop does not support the '..' operator for a range. It seems to me you have two options - either write out the {date1,date2,date2,...,dateN} list by hand, which if this is a rare use case is probably the way to go, or write a wrapper script which generates that list for you. Building such a list from a date range should be a trivial task for the scripting language of your choice. For my application, I've gone with the generated list route, and it's working fine (CHD3 distribution).

answered Sep 21 '22 02:09

Mark Tozzi

Related questions
                            
                                Best splittable compression for Hadoop input = bz2?
                            
                                How do I copy files from S3 to Amazon EMR HDFS?
                            
                                What should be hadoop.tmp.dir ?
                            
                                Change File Split size in Hadoop
                            
                                How to calculate Date difference in Hive
                            
                                Should I call ugi.checkTGTAndReloginFromKeytab() before every action on hadoop?
                            
                                How to make shark/spark clear the cache?
                            
                                hadoop fs -ls results in "no such file or directory"
                            
                                IllegalAccessError to guava's StopWatch from org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus
                            
                                Merge Spark output CSV files with a single header
                            
                                Advantages of using NullWritable in Hadoop
                            
                                LeaseExpiredException: No lease error on HDFS
                            
                                Hadoop safemode recovery - taking too long!
                            
                                How to delete files from the HDFS?
                            
                                How to restart yarn on AWS EMR
                            
                                HDFS_NAMENODE_USER, HDFS_DATANODE_USER & HDFS_SECONDARYNAMENODE_USER not defined
                            
                                MapReduce or Spark? [closed]
                            
                                Display the SQL definition of a hive view
                            
                                Apache Storm compared to Hadoop
                            
                                Python read file as stream from HDFS

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pig Latin: Load multiple files from a date range (part of the directory structure)

Tags:

hadoop

apache-pig

Arnkrishn

People also ask

2 Answers

Romain

Mark Tozzi

Recent Activity

Donate For Us