I am new to Hadoop. I want to cluster ~150 million items with each of them having ~30 attributes using Hierarchical Clustering. Total number of dimensions/attributes is ~5000.
I have designed a multi-level solution by partitioning the whole data and performing clustering on each partitions and merging each cluster there after until desired number of clusters are retrieved.
- Clustering is performed in each map task. So, each map task would be cpu-intensive.
- I am stuck at deciding about which of the following options to use:
- Map-Reduce in native Java.
- Map-Reduce using Hadoop Streaming on C.(This is because of each task being cpu-intensive).
Which option should I go with?. Is there any other way I could achieve my destination?
In many cases, Java (when well written) will yield similar performance to C unless the C code is carefully optimized. In surprisingly many cases, well-written Java code does outperform C code, because the C code is optimized at compilation time, whereas the Java hotspot compiler optimizes at runtime (where it has statistics available on how often each codepath is used).
If you gathered similar statistics, and they do not change depending on your data, you can sometimes give hints to the C compiler, e.g. by using __builtin_expect()
available in some C compilers. But that is really hard to do.
Bear in mind however, that some parts of Java come at a quite high price:
Collection<Integer>
, ArrayList<Double>
etc. for computation, because of boxing cost. These are really really expensive in hot loops.BufferedReader
. There is a reason why Hadoop uses Text
instead of String
- buffer recycling reduces I/O cost.For hadoop, bear in mind that Hadoop streaming isn't free. In case you haven't realized: hadoop-streaming itself is implemented in Java. All data will pass through Java. Hadoop streaming is a Java application that launches your script application, writes data to it (i.e. serializes the data!), and reads back the output (deserializes the data!). You pretty much get all of the Java cost in addition to your actual programs cost: hadoop streaming is a mapper written in Java, that passes the data to an external program, reads back the answer, and return this to Hadoop. Benchmark something simple as a word count written in C vs. an optimized word count in Java to see the difference.
For your actual task, of doing HAC: first make sure you have a working similarity. There is nothing worse than building a large scale clustering algorithm just to find out it does not work because you cannot measure similarity in a meaningful way. First solve the problem on a small sample, then scale up.
If it really matters, you would have to profile each of them. There is no way of telling upfront.
My intuition is that a straightforward Java implementation will perform similar to native C unless you start hand-optimizing the latter.
Keep in mind that MapReduce's oftentimes have high IO times, in particular when reading from text files. So doing a few hundred passes of KMeans or computing a SVD may not be that expensive after all. So you want to measure that aspect as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With