I manage a small team of developers and at any given time we have several on going (one-off) data projects that could be considered "Embarrassingly parallel" - These generally involve running a single script on a single computer for several days, a classic example would be processing several thousand PDF files to extract some key text and place into a CSV file for later insertion into a database. We are now doing enough of these type of tasks that I started to investigate developing a simple job queue system using RabbitMQ with a few spare servers (with an eye to use Amazon SQS/S3/EC2 for projects that needed larger scaling) In searching for examples of others doing this I keep coming across the classic Hadoop New York Times example: <blockquote> The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth) </blockquote> Which sounds perfect? So I researched Hadoop and Map/Reduce. But what I can't work out is how they did it? Or why they did it? Converting TIFF's in PDF's is not a Map/Reduce problem surely? Wouldn't a simple Job Queue have been better? The other classic Hadoop example is the "wordcount" from the Yahoo Hadoop Tutorial seems a perfect fit for Map/Reduce, and I can see why it is such a powerful tool for Big Data. I don't understand how these "Embarrassingly parallel" tasks are put into the Map/Reduce pattern? TL;DR This is very much a conceptual question, basically I want to know how would I fit a task of "processing several thousand PDF files to extract some key text and place into a CSV file" into a Map/Reduce pattern? If you know of any examples that would be perfect, I'm not asking you to write it for me. (Notes: We have code to process the PDF's, I'm not asking for that - it's just an example, it could be any task. I'm asking about putting that processes like that into the Hadoop Map/Reduce pattern - when there is no clear "Map" or "Reduce" elements to a task.) Cheers!

Your thinking is right. The above examples that you mentioned used only part of the solution that hadoop offers. They definitely used parallel computing ability of hadoop plus the distributed file system. It's not necessary that you always will need a reduce step. You may not have any data interdependency between the parallel processes that are run. in which case you will eliminate the reduce step. I think your problem also will fit into hadoop solution domain. You have huge data - huge number of PDF files And a long running job You can process these files parallely by placing your files on HDFS and running a MapReduce job. Your processing time theoretically improves by the number of nodes that you have on your cluster. If you do not see the need to aggregate the data sets that are produced by the individual threads you do not need to use a reduce step else you need to design a reduce step as well. The thing here is if you do not need a reduce step, you are just leveraging the parallel computing ability of hadoop plus you are equipped to run your jobs on not so expensive hardware.

Using Hadoop for Parallel Processing rather than Big Data

Tags:

hadoop

mapreduce

I manage a small team of developers and at any given time we have several on going (one-off) data projects that could be considered "Embarrassingly parallel" - These generally involve running a single script on a single computer for several days, a classic example would be processing several thousand PDF files to extract some key text and place into a CSV file for later insertion into a database.

We are now doing enough of these type of tasks that I started to investigate developing a simple job queue system using RabbitMQ with a few spare servers (with an eye to use Amazon SQS/S3/EC2 for projects that needed larger scaling)

In searching for examples of others doing this I keep coming across the classic Hadoop New York Times example:

The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth)

Which sounds perfect? So I researched Hadoop and Map/Reduce.

But what I can't work out is how they did it? Or why they did it?

Converting TIFF's in PDF's is not a Map/Reduce problem surely? Wouldn't a simple Job Queue have been better?

The other classic Hadoop example is the "wordcount" from the Yahoo Hadoop Tutorial seems a perfect fit for Map/Reduce, and I can see why it is such a powerful tool for Big Data.

I don't understand how these "Embarrassingly parallel" tasks are put into the Map/Reduce pattern?

TL;DR

This is very much a conceptual question, basically I want to know how would I fit a task of "processing several thousand PDF files to extract some key text and place into a CSV file" into a Map/Reduce pattern?

If you know of any examples that would be perfect, I'm not asking you to write it for me.

(Notes: We have code to process the PDF's, I'm not asking for that - it's just an example, it could be any task. I'm asking about putting that processes like that into the Hadoop Map/Reduce pattern - when there is no clear "Map" or "Reduce" elements to a task.)

Cheers!

631

asked Apr 01 '13 12:04

Snowpoch

1 Answers

Your thinking is right.

The above examples that you mentioned used only part of the solution that hadoop offers. They definitely used parallel computing ability of hadoop plus the distributed file system. It's not necessary that you always will need a reduce step. You may not have any data interdependency between the parallel processes that are run. in which case you will eliminate the reduce step.

I think your problem also will fit into hadoop solution domain.

You have huge data - huge number of PDF files And a long running job

You can process these files parallely by placing your files on HDFS and running a MapReduce job. Your processing time theoretically improves by the number of nodes that you have on your cluster. If you do not see the need to aggregate the data sets that are produced by the individual threads you do not need to use a reduce step else you need to design a reduce step as well.

The thing here is if you do not need a reduce step, you are just leveraging the parallel computing ability of hadoop plus you are equipped to run your jobs on not so expensive hardware.

121

answered Oct 28 '22 16:10

Rags

Related questions
                            
                                copyFromLocal: unexpected URISyntaxException
                            
                                Apache Hive How to round off to 2 decimal places?
                            
                                Spark 1.6-Failed to locate the winutils binary in the hadoop binary path
                            
                                How to get file size
                            
                                Mapper input Key-Value pair in Hadoop
                            
                                Hadoop 2.2.0 : "name or service not known" Warning
                            
                                How to get ID of a map task in Spark?
                            
                                hadoop fs -du gives two data columns
                            
                                org.apache.hadoop.mapred.FileAlreadyExistsException
                            
                                error in namenode starting
                            
                                Hadoop YARN: Get a list of available queues
                            
                                How to connect to Hadoop/Hive from .NET
                            
                                Hive ParseException - cannot recognize input near 'end' 'string'
                            
                                How do you retrieve the replication factor info in Hdfs files?
                            
                                What is the difference between single node & pseudo-distributed mode in Hadoop?
                            
                                How to open/stream .zip files through Spark?
                            
                                How to output multiple s3 files in Parquet
                            
                                Unable to load native hadoop library for Mac OS X
                            
                                Define tuple datas in the pig script
                            
                                How do I submit more than one job to Hadoop in a step using the Elastic MapReduce API?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With