Multiple files as input on Amazon Elastic MapReduce

Tags:

I'm trying to run a job on Elastic MapReduce (EMR) with a custom jar. I'm trying to process about a 1000 files in a single directory. When I submit my job with the parameter s3n://bucketname/compressed/*.xml.gz, I get a "matched 0 files" error. If I pass just the absolute path to a file (e.g. s3n://bucketname/compressed/00001.xml.gz), it runs fine, but only one file gets processed. I tried using the name of the directory (s3n://bucketname/compressed/), hoping that the files within will be processed, but that just passes the directory to the job.

At the same time, I have a smaller local hadoop installation. In that, when I pass my job with wildcards (/path/to/dir/on/hdfs/*.xml.gz), it works fine and all 1000 files are listed correctly.

How do I get EMR to list all my files?

556

asked Jul 20 '11 15:07

Shashank Agarwal

1 Answers

I don't know how EMR lists all the files, but here's a piece of code which works for me:

        FileSystem fs = FileSystem.get(URI.create(args[0]), job.getConfiguration());
        FileStatus[] files = fs.listStatus(new Path(args[0]));
        for(FileStatus sfs:files){
            FileInputFormat.addInputPath(job, sfs.getPath());
        }

It will list all the files which are in the input directory, and you can do to those anything that you will

134

answered Nov 13 '22 23:11

Arsen Zahray

Related questions
                            
                                True java generics (templates)
                            
                                How do I import an Android library and use it in both production code and tests?
                            
                                java wsimport rename/different ObjectFactory.java
                            
                                How to capture sound from microphone with java sound API?
                            
                                ScalaTest in Java Eclipse project
                            
                                Is there a native implementation version of Java Advanced imaging api?
                            
                                IBM to IEEE floating point conv
                            
                                Sample persistence.xml for a production instance using jpa2 and hibernate 3.6.x
                            
                                Using a Python Script in Java (Eclipse)
                            
                                how to draw diamond using tapestry component t:loop
                            
                                Finding the single unknown in an equation
                            
                                How to port jaxb annotation to simpleXml library?
                            
                                "No element is found" exception in WebDriver running on Internet Explorer using Java
                            
                                Parsing LLVM bitcode from Java
                            
                                Open Source XPath Filter 2.0 implementation [closed]
                            
                                How to start using Jemmy 3?
                            
                                JPA/Hibernate switch schema programmatically
                            
                                Java process for authentication on Windows against AD (kerberos)
                            
                                Spring Convertor or Property editor for conversion of multiple request parameters into an Object?
                            
                                com.sun.HttpServer socket backlog

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Multiple files as input on Amazon Elastic MapReduce

Tags:

java

amazon-emr

Shashank Agarwal

People also ask

1 Answers

Arsen Zahray

Recent Activity

Donate For Us