Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark local vs hdfs permormance

I have a Spark cluster and a Hdfs on the same machines. I've copied a single text file, about 3Gbytes, on each machine's local filesystem and on hdfs distributed filesystem.

I have a simple word count pyspark program.

If i submit the program reading the file from local filesystem, it lasts about 33 sec. If i submit the program reading the file from hdfs, it lasts about 46 sec.

Why ? I expected exactly the opposite result.

Added after sgvd's request:

16 slaves 1 master

Spark Standalone with no particular settings (replication factor 3)

Version 1.5.2

import sys
sys.path.insert(0, '/usr/local/spark/python/')
sys.path.insert(0, '/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip')
import os
os.environ['SPARK_HOME']='/usr/local/spark'
os.environ['JAVA_HOME']='/usr/local/java'
from pyspark import SparkContext
#conf = pyspark.SparkConf().set<conf settings>


if sys.argv[1] == 'local':
    print 'Esecuzine in modalita local file'
    sc = SparkContext('spark://192.168.2.11:7077','Test Local file')
    rdd = sc.textFile('/root/test2')
else:
    print 'Esecuzine in modalita hdfs'
    sc = SparkContext('spark://192.168.2.11:7077','Test HDFS file')
    rdd = sc.textFile('hdfs://192.168.2.11:9000/data/test2')


rdd1 = rdd.flatMap(lambda x: x.split(' ')).map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y)
topFive = rdd1.takeOrdered(5,key=lambda x: -x[1])
print topFive
like image 752
arj Avatar asked Oct 31 '22 11:10

arj


1 Answers

It is a bit counter intuitive, but since the replication factor is 3 and you have 16 nodes, each node has on average 20% of the data stored locally in the HDFS. Then approximately 6 worker nodes should be sufficient on average to read the entire file without any network transfer.

If you record the running time vs number of worker nodes you should notice that after around 6 there will be no difference between reading from the local FS and from HDFS.

The above computation can be done using variables, e.g. x=number of worker nodes, y= replication factor, but you can see easily that since reading from the local FS imposes that the file is on all nodes you end up with x=y and there will be no difference after floor(x/y) nodes used. This is exactly what you are observing and it seems counter intuitive at first. Would you use replication factor 100% in production?

like image 170
Radu Ionescu Avatar answered Nov 15 '22 08:11

Radu Ionescu