This may be a silly question. I want to make sure I understand this correctly.
When you in a huge file (400GB)
into a cluster, where the collective executor memory is only around 120GB
, Spark seems to read forever. It doesn't crash, nor does it start the first map job.
What I think is happening is, Spark is reading thru the large file as streams, and start discarding the older lines when the executors run out of memory. This obvious can be a problem when execution of .map
codes starts, as the executor jvm would be reading back the file from the beginning again. I am wondering though, whether Spark is somehow spilling the data onto the hard drive, similar to the shuffle spilling mechanism.
Notice, I am not referring to the cache process. This has to do with the initial read using
sc.textFile(filename)
sc.textFile
doesn't commence any reading. It simply defines a driver-resident data structure which can be used for further processing.
It is not until an action is called on an RDD that Spark will build up a strategy to perform all the required transforms (including the read) and then return the result.
If there is an action called to run the sequence, and your next transformation after the read is to map, then Spark will need to read a small section of lines of the file (according to the partitioning strategy based on the number of cores) and then immediately start to map it until it needs to return a result to the driver, or shuffle before the next sequence of transformations.
If your partitioning strategy (defaultMinPartitions
) seems to be swamping the workers because the java representation of your partition (an InputSplit
in HDFS terms) is bigger than available executor memory, then you need to specify the number of partitions to read as the second parameter to textFile
. You can calculate the ideal number of partitions by dividing your file size by your target partition size (allowing for memory growth). A simple check that the file can be read would be:
sc.textFile(file, numPartitions)
.count()
Also, check this question: run reduceByKey on huge data in spark
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With