I'm trying to run a job on Elastic MapReduce (EMR) with a custom jar. I'm trying to process about a 1000 files in a single directory. When I submit my job with the parameter s3n://bucketname/compressed/*.xml.gz
, I get a "matched 0 files" error. If I pass just the absolute path to a file (e.g. s3n://bucketname/compressed/00001.xml.gz
), it runs fine, but only one file gets processed. I tried using the name of the directory (s3n://bucketname/compressed/
), hoping that the files within will be processed, but that just passes the directory to the job.
At the same time, I have a smaller local hadoop installation. In that, when I pass my job with wildcards (/path/to/dir/on/hdfs/*.xml.gz
), it works fine and all 1000 files are listed correctly.
How do I get EMR to list all my files?
The input data that needs to be processed using MapReduce is stored in HDFS. The processing can be done on a single file or a directory that has multiple files.
Previously, Amazon EMR used the s3n and s3a file systems. While both still work, we recommend that you use the s3 URI scheme for the best performance, security, and reliability. The local file system refers to a locally connected disk.
Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark , on AWS to process and analyze vast amounts of data.
I don't know how EMR lists all the files, but here's a piece of code which works for me:
FileSystem fs = FileSystem.get(URI.create(args[0]), job.getConfiguration());
FileStatus[] files = fs.listStatus(new Path(args[0]));
for(FileStatus sfs:files){
FileInputFormat.addInputPath(job, sfs.getPath());
}
It will list all the files which are in the input directory, and you can do to those anything that you will
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With