Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark performs slower with hardware scaling up

I am trying to find a fit hardware size for my Spark job. My understanding was that scaling up the number of machines may help speeding up my job, considering the fact that my job does not have a complex action operation and therefore probably small amount of calculation in the driver program. However, what I observe is that the job execution speed lowers down when adding resources to Spark. I can reproduce this effect using the following simple job:

  • Loading a text file (~100Gb) from HDFS
  • Running a simple 'filter' transformation on the RDD, that looks like below:

    JavaRDD<String> filteredRDD = rdd.filter(new Function<String, Boolean>() {
        public Boolean call(String s) {
            String filter = "FILTER_STRING";
            return s.indexOf(filter) > 0 ? true : false; 
       }
    
  • Running count() action on the result

The scaling problem shows itself when I scale up the number of machines in the cluster from 4 to 8. Here are some details about the environment:

  • Each executor is configured to use 6 GB of memory. Also the HDFS is co-hosted on the same machines.
  • Each machine has 24 GB of RAM in total and 12 cores (configured to use 8 for Spark executors).
  • Spark is hosted in a YARN cluster.

Any ideas why I am not getting the degree of scalabilty I expect from Spark?

like image 891
asaad Avatar asked Mar 13 '16 21:03

asaad


1 Answers

Thanks to lot of comments, I think I found what was wrong with my cluster. The idea of HDFS 'replication factor' being at least a part of the problem was a very good clue.

In order to test, I changed replication factor of HDFS to the number of cluster nodes and re-ran the tests, and I got scalable results. But I was not convinced about the reason behind this behavior, because Spark claims to consider data locality in assigning partitions to executors and even with the default replication level (3), Spark should have enough room to assign partitions evenly. With some more investigation I figured out that this may not be the case if YARN (or any other cluster manager) decide to share a physical machine with more than one executor and not to use all the machines. In that case there may be HDFS blocks that are not local to any executor, which will result in data transfer across network and the scaling problem that I observed.

like image 153
asaad Avatar answered Sep 22 '22 20:09

asaad