Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple files as input on Amazon Elastic MapReduce

I'm trying to run a job on Elastic MapReduce (EMR) with a custom jar. I'm trying to process about a 1000 files in a single directory. When I submit my job with the parameter s3n://bucketname/compressed/*.xml.gz, I get a "matched 0 files" error. If I pass just the absolute path to a file (e.g. s3n://bucketname/compressed/00001.xml.gz), it runs fine, but only one file gets processed. I tried using the name of the directory (s3n://bucketname/compressed/), hoping that the files within will be processed, but that just passes the directory to the job.

At the same time, I have a smaller local hadoop installation. In that, when I pass my job with wildcards (/path/to/dir/on/hdfs/*.xml.gz), it works fine and all 1000 files are listed correctly.

How do I get EMR to list all my files?

like image 556
Shashank Agarwal Avatar asked Jul 20 '11 15:07

Shashank Agarwal


People also ask

Can we process a directory with multiple files using MapReduce?

The input data that needs to be processed using MapReduce is stored in HDFS. The processing can be done on a single file or a directory that has multiple files.

Which S3 file system should I use with Amazon Elastic MapReduce?

Previously, Amazon EMR used the s3n and s3a file systems. While both still work, we recommend that you use the s3 URI scheme for the best performance, security, and reliability. The local file system refers to a locally connected disk.

How is Amazon Elastic MapReduce?

Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark , on AWS to process and analyze vast amounts of data.


1 Answers

I don't know how EMR lists all the files, but here's a piece of code which works for me:

        FileSystem fs = FileSystem.get(URI.create(args[0]), job.getConfiguration());
        FileStatus[] files = fs.listStatus(new Path(args[0]));
        for(FileStatus sfs:files){
            FileInputFormat.addInputPath(job, sfs.getPath());
        }

It will list all the files which are in the input directory, and you can do to those anything that you will

like image 134
Arsen Zahray Avatar answered Nov 13 '22 23:11

Arsen Zahray