Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Processing images using hadoop

I'm new to hadoop and I'm going to develop an application which process multiple images using hadoop and show users the results live, while they computation is in progress. The basic approach is distribute executable and bunch of images and gather the results.

Can I get results interactively while the computing process is in progress?

Are there any other alternatives than hadoop streaming, for such use case?

How can I feed executable with images? I can't find any examples other than feeding it with stdin.

like image 261
remdezx Avatar asked Apr 14 '14 10:04

remdezx


People also ask

What is Hadoop image?

1. HIPI is an image processing library designed to be used with the Apache Hadoop MapReduce parallel programming framework. HIPI facilitates efficient and high throughput image processing with MapReduce style parallel programs typically executed on a cluster (University of Virginia Computer Graphics Lab, 2016).

What is processing in Hadoop?

It is a programming technique based on Java that is used on top of the Hadoop framework for faster processing of huge quantities of data. It processes this huge data in a distributed environment using many Data Nodes which enables parallel processing and faster execution of operations in a fault-tolerant way.

How an image is stored in HDFS?

So to store the images or frames into the HDFS, first convert the frames as the stream of bytes and then store in HDFS. Hadoop provides us the facility to read/write binary files. So, practically anything which can be converted into bytes can be stored in HDFS.

Is Hadoop used for data processing?

Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.


1 Answers

For processing images on Hadoop the best way to organize the computations would be:

  1. Store images in a sequence file. Key - image name or its ID, Value - image binary data. This way you will have a single file with all the images you need to process. If you have images added dynamically to your system, consider aggregating them in daily sequence files. I don't think you should use any compression for this sequence file as the general compression algorithms does not work well with images
  2. Process the images. Here you have a number of options to choose. First is to use Hadoop MapReduce and write a program in Java as with Java you would be able to read the sequence file and directly obtain the "Value" from it on each map step, where the "value" is the binary file data. Given this, you can run any processing logic. Second option is Hadoop Streaming. It has a limitation that all the data goes to stdin of your application and the result is read from stdout. But you can overcome this by writing your own InputFormat in Java that would serialize the image binary data from sequence file as Base64 string and pass it to your generic application. Third option would be to use Spark to process this data, but again you are limited in the programming languages choise: Scala, Java or Python.
  3. Hadoop was developed to simplify batch processing over the large amounts of data. Spark is essentialy similar - it is a batch tool. This means you cannot get any result before all the data is processed. Spark Streaming is a bit different case - there you work with micro batches of 1-10 seconds and process each of them separately, so in general you can make it work for your case.

I don't know the complete case of yours, but one possible solution is to use Kafka + Spark Streaming. Your application should put the images in a binary format to the Kafka queue while Spark will consume and process them in micro batches on the cluster, updating the users through some third component (at least by putting the image processing status into the Kafka for another application to process it)

But in general, information you provided is not complete to recommend a good architecture for your specific case

like image 54
0x0FFF Avatar answered Sep 24 '22 23:09

0x0FFF