Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should I prefer hadoop vs condor when working with R?

Tags:

r

hadoop

condor

I am looking for ways to send works for multiple computers on my University computer grid.

Currently it is running Condor and also offers Hadoop.

My question is thus, should I try and interface with R to Hadoop or to the Conder for my projects?

For the discussion, let's assume we are talking about Embarrassingly parallel tasks.

p.s: I've seen the resources described in the CRAN task views.

like image 252
Tal Galili Avatar asked Nov 04 '10 10:11

Tal Galili


People also ask

Does spark replace Hadoop?

So when people say that Spark is replacing Hadoop, it actually means that big data professionals now prefer to use Apache Spark for processing the data instead of Hadoop MapReduce. MapReduce and Hadoop are not the same – MapReduce is just a component to process the data in Hadoop and so is Spark.

What is the best result using Hadoop or spark?

Spark is much faster as it uses MLib for computations and has in-memory processing. Hadoop has a slower performance as it uses disk for storage and depends upon disk read and write operations. It has fast performance with reduced disk reading and writing operations.

Is Hadoop still used?

After all, a large number of Internet companies still use Apache Hadoop (at their scale, only the open-source version can be used).

Why Hadoop is not suitable for small files?

Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.


1 Answers

You can do both.

You can use HDFS for your data sets and Condor for your job scheduling. Using Condor to place executors on machines and HDFS + Hadoops Map-Reduce features to process your data (assuming your problem is map-reduce mappable). Then you're using the most appropriate tool for the job: Condor is a job scheduler, and as such does that work better than Hadoop. And Hadoop's HDFS and M-R framework are things Condor doesn't have (but are really helpful for jobs running on Condor to use).

I would personally look at has HDFS to share data among jobs that run discretely as Condor jobs. Especially in a university environment, where shared compute resources are not 100% reliable and can come and go at will, Condor's resilience in this type of set up is going to make getting work done a whole lot easier.

like image 148
Ian C. Avatar answered Sep 28 '22 07:09

Ian C.