Hadoop

Question

Is Hadoop a proper solution for jobs that are CPU intensive and need to process a small file of around 500 MB? I have read that Hadoop is aimed to process the so called Big Data, and I wonder how it performs with a small amount of data (but a CPU intensive workload).

I would mainly like to know if a better approach for this scenario exists or instead I should stick to Hadoop.

merours · Accepted Answer

Hadoop is a distributed computing framework proposing a MapReduce engine. If you can express your parallelizable cpu intensive application with this paradigm (or any other supported by Hadoop modules), you may take advantage of Hadoop. A classical example of Hadoop computations is the calculation of Pi, which doesn't need any input data. As you'll see here, yahoo managed to determine the two quadrillonth digit of pi thanks to Hadoop.

However, Hadoop is indeed specialized for Big Data in the sense that it was developped for this aim. For instance, you dispose of a file system designed to contain huge files. These huge files are chunked into a lot of blocks accross a large number of nodes. In order to ensure your data integrity, each block has to be replicated to other nodes.

To conclude, I'd say that if you already dispose of an Hadoop cluster, you may want to take advantage of it. If that's not the case, and while I can't recommand anything since I have no idea what exactly is your need, I think you can find more light weights frameworks than Hadoop.

samthebest · Answer

Well a lot of companies are moving to Spark, and I personally believe it's the future of parallel processing.

It sounds like what you want to do is use many CPUs possibly on many nodes. For this you should use a Scalable Language especially designed for this problem - in other words Scala. Using Scala with Spark is much much easier and much much faster than hadoop.

If you don't have access to a cluster, it can be an idea to use Spark anyway so that you can use it in future more easily. Or just use .par in Scala and that will paralellalize your code and use all the CPUs on your local machine.

Finally Hadoop is indeed intended for Big Data, whereas Spark is really just a very general MPP framework.

Hadoop - CPU intensive application - Small data

Tags:

performance

cpu

Mikel Urkia

2 Answers

merours

samthebest

Recent Activity

Donate For Us