I have a project that requires me to process a lot (1000-10000) of big (100MB to 500MB) images. The processing I am doing can be done via Imagemagick, but I was hoping to actually do this processing on Amazon's Elastic MapReduce platform (which I believe runs using Hadoop).
Of all of the examples I have found, they all deal with text-based inputs (I have found that Word Count sample a billion times). I cannot find anything about this kind of work with Hadoop: starting with a set of files, performing the same action to each of the files, and then writing out the new file's output as it's own file.
I am pretty sure this can be done with this platform, and should be able to be done using Bash; I don't think I need to go to the trouble of creating a whole Java app or something, but I could be wrong.
I'm not asking for someone to hand me code, but if anyone has sample code or links to tutorials dealing with similar issues, it would be much appreciated...
There are several problems with your task.
Hadoop does not natively process images as you've seen. But you can export all the file names and paths as a textfile and call some Map function on it. So calling ImageMagick on the files on local disk should not be a great deal.
But how do you deal with data locality?
You can't run ImageMagick on files in HDFS (only Java API and FUSE mount is not stable) and you can't predict the task scheduling. So for example a map task can be scheduled to a host where the image does not exists.
Sure you can simply use just a single machine and a single task. But then you don't have an improvement. You would then just have a bunch of overhead.
Also there is a memory problem when you shell out from a Java task. I made a blog post about it [1].
and should be able to be done using Bash
That is the next problem, you'd have to write the map task at least. You need a ProcessBuilder
to call ImageMagick with a specific path and function.
I cannot find anything about this kind of work with Hadoop: starting with a set of files, performing the same action to each of the files, and then writing out the new file's output as it's own file.
Guess why? :D Hadoop is not the right thing for this task.
So basically I would recommend to manually split your images to multiple hosts in EC2 and run a bash script over it. It is less stress and is faster. To parallize on the same host, split your files in multiple folders for each core and run the bash scripts over it. This should utilize your machine quite well, and better than Hadoop ever could.
[1] http://codingwiththomas.blogspot.com/2011/07/dealing-with-outofmemoryerror-in-hadoop.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With