Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark reading large file

This may be a silly question. I want to make sure I understand this correctly.

When you in a huge file (400GB) into a cluster, where the collective executor memory is only around 120GB, Spark seems to read forever. It doesn't crash, nor does it start the first map job.

What I think is happening is, Spark is reading thru the large file as streams, and start discarding the older lines when the executors run out of memory. This obvious can be a problem when execution of .map codes starts, as the executor jvm would be reading back the file from the beginning again. I am wondering though, whether Spark is somehow spilling the data onto the hard drive, similar to the shuffle spilling mechanism.

Notice, I am not referring to the cache process. This has to do with the initial read using sc.textFile(filename)

like image 311
bhomass Avatar asked Jun 29 '15 02:06

bhomass


1 Answers

sc.textFile doesn't commence any reading. It simply defines a driver-resident data structure which can be used for further processing.

It is not until an action is called on an RDD that Spark will build up a strategy to perform all the required transforms (including the read) and then return the result.

If there is an action called to run the sequence, and your next transformation after the read is to map, then Spark will need to read a small section of lines of the file (according to the partitioning strategy based on the number of cores) and then immediately start to map it until it needs to return a result to the driver, or shuffle before the next sequence of transformations.

If your partitioning strategy (defaultMinPartitions) seems to be swamping the workers because the java representation of your partition (an InputSplit in HDFS terms) is bigger than available executor memory, then you need to specify the number of partitions to read as the second parameter to textFile. You can calculate the ideal number of partitions by dividing your file size by your target partition size (allowing for memory growth). A simple check that the file can be read would be:

sc.textFile(file, numPartitions)
  .count()  

Also, check this question: run reduceByKey on huge data in spark

like image 150
Alister Lee Avatar answered Oct 09 '22 03:10

Alister Lee