Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is Hadoop right for running my simulations?

have written a stochastic simulation in Java, which loads data from a few CSV files on disk (totaling about 100MB) and writes results to another output file (not much data, just a boolean and a few numbers). There is also a parameters file, and for different parameters the distribution of simulation outputs would be expected to change. To determine the correct/best input parameters I need to run multiple simulations, across multiple input parameter configurations, and look at the distributions of the outputs in each group. Each simulation takes 0.1-10 min depending on parameters and randomness.

I've been reading about Hadoop and wondering if it can help me running lots of simulations; I may have access to about 8 networked desktop machines in the near future. If I understand correctly, the map function could run my simulation and spit out the result, and the reducer might be the identity.

The thing I'm worried about is HDFS, which seems to meant for huge files, not a smattering of small CSV files, (none of which would big enough to even make up the minimum recommended block size of 64MB). Furthermore, each simulation would only need an identical copy of each of the CSV files.

Is Hadoop the wrong tool for me?

like image 497
Pengin Avatar asked Oct 19 '09 16:10

Pengin


People also ask

Why is Hadoop the best data processing framework?

Its efficient use of processing power makes it both fast and efficient. Many teams abandoned their projects before the arrival of frameworks like Hadoop, due to the high costs they incurred. Hadoop is an open-source framework, it is free to use, and it uses cheap commodity hardware to store data.

Is Hadoop still used?

The Inconsistent Quality of Hadoop's Open Source Ecosystem After all, a large number of Internet companies still use Apache Hadoop (at their scale, only the open-source version can be used).

What is framework in Hadoop?

Hadoop: This is a software library written in Java used for processing large amounts of data in a distributed environment. It allows developers to setup clusters of computers, starting with a single node that can scale up to thousands of nodes. Hive: Hive is data warehousing framework that's built on Hadoop.

Is Hadoop popular?

Hadoop and Spark are two of the most popular data processing frameworks for big data architectures. They're both at the center of a rich ecosystem of open source technologies for processing, managing and analyzing sets of big data.


1 Answers

I see a number of answers here that basically are saying, "no, you shouldn't use Hadoop for simulations because it wasn't built for simulations." I believe this is a rather short sighted view and would be akin to someone saying in 1985, "you can't use a PC for word processing, PCs are for spreadsheets!"

Hadoop is a fantastic framework for construction of a simulation engine. I've been using it for this purpose for months and have had great success with small data / large computation problems. Here's the top 5 reasons I migrated to Hadoop for simulation (using R as my language for simulations, btw):

  1. Access: I can lease Hadoop clusters through either Amazon Elastic Map Reduce and I don't have to invest any time and energy into the administration of a cluster. This meant I could actually start doing simulations on a distributed framework without having to get administrative approval in my org!
  2. Administration: Hadoop handles job control issues, like node failure, invisibly. I don't have to code for these conditions. If a node fails, Hadoop makes sure the sims scheduled for that node gets run on another node.
  3. Upgradeable: Being a rather generic map reduce engine with a great distributed file system if you later have problems that involve large data if you're used to using Hadoop you don't have to migrate to a new solution. So Hadoop gives you a simulation platform that will also scale to a large data platform for (nearly) free!
  4. Support: Being open source and used by so many companies, the number of resources, both on line and off, for Hadoop are numerous. Many of those resources are written with the assumption of "big data" but they are still useful for learning to think in a map reduce way.
  5. Portability: I have built analysis on top of proprietary engines using proprietary tools which took considerable learning to get working. When I later changed jobs and found myself at a firm without that same proprietary stack I had to learn a new set of tools and a new simulation stack. Never again. I traded in SAS for R and our old grid framework for Hadoop. Both are open source and I know that I can land at any job in the future and immediately have tools at my fingertips to start kicking ass.
like image 84
JD Long Avatar answered Sep 22 '22 08:09

JD Long