I am trying to find a fit hardware size for my Spark job. My understanding was that scaling up the number of machines may help speeding up my job, considering the fact that my job does not have a complex action operation and therefore probably small amount of calculation in the driver program. However, what I observe is that the job execution speed lowers down when adding resources to Spark. I can reproduce this effect using the following simple job:
Running a simple 'filter' transformation on the RDD, that looks like below:
JavaRDD<String> filteredRDD = rdd.filter(new Function<String, Boolean>() {
public Boolean call(String s) {
String filter = "FILTER_STRING";
return s.indexOf(filter) > 0 ? true : false;
}
The scaling problem shows itself when I scale up the number of machines in the cluster from 4 to 8. Here are some details about the environment:
Any ideas why I am not getting the degree of scalabilty I expect from Spark?
Thanks to lot of comments, I think I found what was wrong with my cluster. The idea of HDFS 'replication factor' being at least a part of the problem was a very good clue.
In order to test, I changed replication factor of HDFS to the number of cluster nodes and re-ran the tests, and I got scalable results. But I was not convinced about the reason behind this behavior, because Spark claims to consider data locality in assigning partitions to executors and even with the default replication level (3), Spark should have enough room to assign partitions evenly. With some more investigation I figured out that this may not be the case if YARN (or any other cluster manager) decide to share a physical machine with more than one executor and not to use all the machines. In that case there may be HDFS blocks that are not local to any executor, which will result in data transfer across network and the scaling problem that I observed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With