Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Hadoop for Parallel Processing rather than Big Data

I manage a small team of developers and at any given time we have several on going (one-off) data projects that could be considered "Embarrassingly parallel" - These generally involve running a single script on a single computer for several days, a classic example would be processing several thousand PDF files to extract some key text and place into a CSV file for later insertion into a database.

We are now doing enough of these type of tasks that I started to investigate developing a simple job queue system using RabbitMQ with a few spare servers (with an eye to use Amazon SQS/S3/EC2 for projects that needed larger scaling)

In searching for examples of others doing this I keep coming across the classic Hadoop New York Times example:

The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth)

Which sounds perfect? So I researched Hadoop and Map/Reduce.

But what I can't work out is how they did it? Or why they did it?

Converting TIFF's in PDF's is not a Map/Reduce problem surely? Wouldn't a simple Job Queue have been better?

The other classic Hadoop example is the "wordcount" from the Yahoo Hadoop Tutorial seems a perfect fit for Map/Reduce, and I can see why it is such a powerful tool for Big Data.

I don't understand how these "Embarrassingly parallel" tasks are put into the Map/Reduce pattern?

TL;DR

This is very much a conceptual question, basically I want to know how would I fit a task of "processing several thousand PDF files to extract some key text and place into a CSV file" into a Map/Reduce pattern?

If you know of any examples that would be perfect, I'm not asking you to write it for me.

(Notes: We have code to process the PDF's, I'm not asking for that - it's just an example, it could be any task. I'm asking about putting that processes like that into the Hadoop Map/Reduce pattern - when there is no clear "Map" or "Reduce" elements to a task.)

Cheers!

like image 631
Snowpoch Avatar asked Apr 01 '13 12:04

Snowpoch


People also ask

Is Hadoop suitable for parallel computing?

Hadoop allows parallel and distributed processing. Each feature selector can be divided into subtasks and the subtasks can then be processed in parallel. Multiple feature selectors can also be processed simultaneously (in parallel) allowing multiple feature selectors to be compared.

How does Hadoop help the utility of parallel computing?

Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.

Does HDFS performs parallel processing of data?

Because HDFS stores data in a distributed manner, the data can be processed in parallel on a cluster of nodes.

How is Hadoop different from other parallel computing systems?

Hadoop is a distributed file system that allows you to store large quantities of data and handle data redundancy over a cloud of computers. The major advantage is to process it in a distributed way because data is saved in multiple knots.


1 Answers

Your thinking is right.

The above examples that you mentioned used only part of the solution that hadoop offers. They definitely used parallel computing ability of hadoop plus the distributed file system. It's not necessary that you always will need a reduce step. You may not have any data interdependency between the parallel processes that are run. in which case you will eliminate the reduce step.

I think your problem also will fit into hadoop solution domain.

You have huge data - huge number of PDF files And a long running job

You can process these files parallely by placing your files on HDFS and running a MapReduce job. Your processing time theoretically improves by the number of nodes that you have on your cluster. If you do not see the need to aggregate the data sets that are produced by the individual threads you do not need to use a reduce step else you need to design a reduce step as well.

The thing here is if you do not need a reduce step, you are just leveraging the parallel computing ability of hadoop plus you are equipped to run your jobs on not so expensive hardware.

like image 121
Rags Avatar answered Oct 28 '22 16:10

Rags