have written a stochastic simulation in Java, which loads data from a few CSV files on disk (totaling about 100MB) and writes results to another output file (not much data, just a boolean and a few numbers). There is also a parameters file, and for different parameters the distribution of simulation outputs would be expected to change. To determine the correct/best input parameters I need to run multiple simulations, across multiple input parameter configurations, and look at the distributions of the outputs in each group. Each simulation takes 0.1-10 min depending on parameters and randomness.
I've been reading about Hadoop and wondering if it can help me running lots of simulations; I may have access to about 8 networked desktop machines in the near future. If I understand correctly, the map function could run my simulation and spit out the result, and the reducer might be the identity.
The thing I'm worried about is HDFS, which seems to meant for huge files, not a smattering of small CSV files, (none of which would big enough to even make up the minimum recommended block size of 64MB). Furthermore, each simulation would only need an identical copy of each of the CSV files.
Is Hadoop the wrong tool for me?
Its efficient use of processing power makes it both fast and efficient. Many teams abandoned their projects before the arrival of frameworks like Hadoop, due to the high costs they incurred. Hadoop is an open-source framework, it is free to use, and it uses cheap commodity hardware to store data.
The Inconsistent Quality of Hadoop's Open Source Ecosystem After all, a large number of Internet companies still use Apache Hadoop (at their scale, only the open-source version can be used).
Hadoop: This is a software library written in Java used for processing large amounts of data in a distributed environment. It allows developers to setup clusters of computers, starting with a single node that can scale up to thousands of nodes. Hive: Hive is data warehousing framework that's built on Hadoop.
Hadoop and Spark are two of the most popular data processing frameworks for big data architectures. They're both at the center of a rich ecosystem of open source technologies for processing, managing and analyzing sets of big data.
I see a number of answers here that basically are saying, "no, you shouldn't use Hadoop for simulations because it wasn't built for simulations." I believe this is a rather short sighted view and would be akin to someone saying in 1985, "you can't use a PC for word processing, PCs are for spreadsheets!"
Hadoop is a fantastic framework for construction of a simulation engine. I've been using it for this purpose for months and have had great success with small data / large computation problems. Here's the top 5 reasons I migrated to Hadoop for simulation (using R as my language for simulations, btw):
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With