Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Amazon MapReduce/Hadoop for Image Processing

I have a project that requires me to process a lot (1000-10000) of big (100MB to 500MB) images. The processing I am doing can be done via Imagemagick, but I was hoping to actually do this processing on Amazon's Elastic MapReduce platform (which I believe runs using Hadoop).

Of all of the examples I have found, they all deal with text-based inputs (I have found that Word Count sample a billion times). I cannot find anything about this kind of work with Hadoop: starting with a set of files, performing the same action to each of the files, and then writing out the new file's output as it's own file.

I am pretty sure this can be done with this platform, and should be able to be done using Bash; I don't think I need to go to the trouble of creating a whole Java app or something, but I could be wrong.

I'm not asking for someone to hand me code, but if anyone has sample code or links to tutorials dealing with similar issues, it would be much appreciated...

like image 681
jzimmerman2011 Avatar asked Dec 13 '22 08:12

jzimmerman2011


1 Answers

There are several problems with your task.

Hadoop does not natively process images as you've seen. But you can export all the file names and paths as a textfile and call some Map function on it. So calling ImageMagick on the files on local disk should not be a great deal.

But how do you deal with data locality?

You can't run ImageMagick on files in HDFS (only Java API and FUSE mount is not stable) and you can't predict the task scheduling. So for example a map task can be scheduled to a host where the image does not exists.

Sure you can simply use just a single machine and a single task. But then you don't have an improvement. You would then just have a bunch of overhead.

Also there is a memory problem when you shell out from a Java task. I made a blog post about it [1].

and should be able to be done using Bash

That is the next problem, you'd have to write the map task at least. You need a ProcessBuilder to call ImageMagick with a specific path and function.

I cannot find anything about this kind of work with Hadoop: starting with a set of files, performing the same action to each of the files, and then writing out the new file's output as it's own file.

Guess why? :D Hadoop is not the right thing for this task.

So basically I would recommend to manually split your images to multiple hosts in EC2 and run a bash script over it. It is less stress and is faster. To parallize on the same host, split your files in multiple folders for each core and run the bash scripts over it. This should utilize your machine quite well, and better than Hadoop ever could.

[1] http://codingwiththomas.blogspot.com/2011/07/dealing-with-outofmemoryerror-in-hadoop.html

like image 82
Thomas Jungblut Avatar answered Dec 21 '22 16:12

Thomas Jungblut