I have a Spark cluster and a Hdfs on the same machines. I've copied a single text file, about 3Gbytes, on each machine's local filesystem and on hdfs distributed filesystem.
I have a simple word count pyspark program.
If i submit the program reading the file from local filesystem, it lasts about 33 sec. If i submit the program reading the file from hdfs, it lasts about 46 sec.
Why ? I expected exactly the opposite result.
Added after sgvd's request:
16 slaves 1 master
Spark Standalone with no particular settings (replication factor 3)
Version 1.5.2
import sys
sys.path.insert(0, '/usr/local/spark/python/')
sys.path.insert(0, '/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip')
import os
os.environ['SPARK_HOME']='/usr/local/spark'
os.environ['JAVA_HOME']='/usr/local/java'
from pyspark import SparkContext
#conf = pyspark.SparkConf().set<conf settings>
if sys.argv[1] == 'local':
print 'Esecuzine in modalita local file'
sc = SparkContext('spark://192.168.2.11:7077','Test Local file')
rdd = sc.textFile('/root/test2')
else:
print 'Esecuzine in modalita hdfs'
sc = SparkContext('spark://192.168.2.11:7077','Test HDFS file')
rdd = sc.textFile('hdfs://192.168.2.11:9000/data/test2')
rdd1 = rdd.flatMap(lambda x: x.split(' ')).map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y)
topFive = rdd1.takeOrdered(5,key=lambda x: -x[1])
print topFive
It is a bit counter intuitive, but since the replication factor is 3 and you have 16 nodes, each node has on average 20% of the data stored locally in the HDFS. Then approximately 6 worker nodes should be sufficient on average to read the entire file without any network transfer.
If you record the running time vs number of worker nodes you should notice that after around 6 there will be no difference between reading from the local FS and from HDFS.
The above computation can be done using variables, e.g. x=number of worker nodes
, y= replication factor
, but you can see easily that since reading from the local FS imposes that the file is on all nodes you end up with x=y
and there will be no difference after floor(x/y)
nodes used. This is exactly what you are observing and it seems counter intuitive at first. Would you use replication factor 100% in production?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With