Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Apache Spark read and process in the same time, or in first reads entire file in memory and then starts transformations?

I am curious if Spark first reads entire file into memory and only then starts processing it, meaning applying transformations and actions, or it reads first chunk of a file - applies transformation on it, reads second chunk and so on.

Is there any difference between Spark in Hadoop for the same matter? I read that Spark keeps entire file in memory most of the times, while Hadoop not. But what about the initial step when we read it for the first time and map the keys.

Thanks

like image 639
YohanRoth Avatar asked Oct 18 '22 20:10

YohanRoth


1 Answers

I think a fair characterisation would be this:

Both Hadoop (or more accurately MapReduce) and Spark use the same underlying filesystem HDFS to begin with.

During the Mapping phase both will read all data and actually write the map result to disk so that it can be sorted and distributed between nodes via the Shuffle logic. Both of them do in fact try and cache the data just mapped in memory in addition to spilling it to disk for the Shuffle to do its work. The difference here though is that Spark is a lot more efficient in this process, trying to optimally align the node chosen for a specific computation with the data already cached on a certain node. Since Spark also does something called lazy-evaluation the memory use of Spark is very different from Hadoop as a result of planning computation and caching simultaneously.

In in the steps of a word-count job Hadoop does this:

  1. Map all the words to 1.
  2. Write all those mapped pairs of (word, 1) to a single file in HDFS (single file could still span multiple nodes on the distributed HDFS) (this is the shuffle phase)
  3. Sort the rows of (word, 1) in that shared file (this is the sorting phase)
  4. Have the reducers read sections (partitions) from that shared file that now contains all the words sorted and sum up all those 1s for every word.

Spark on the other hand will go the other way around:

  1. It figures that like in Hadoop it is probably most efficient to have all those words summed up via separate Reducer runs, so it decides according to some factors that it wants to split the job into x parts and then merge them into the final result.
  2. So it knows that words will have to be sorted which will require at least part of them in memory at a given time.
  3. After that it evaluates that such a sorted list will require all words mapped to (word, 1) pairs to start the calculation.
  4. It works through steps 3 than 2 than 1.

Now the trick relative to Hadoop is that it knows in Step 3, which in-memory cached items it will need in 2. and in 2. it already knows how these parts (mostly K-V pairs) will be needed in the final step 1. This allows Spark to very efficiently plan the execution of Jobs, but caching data it knows will be needed in later stages of the job. Hadoop working from the beginning (mapping) to the end without explicitly looking ahead into the following stages, simply cannot use memory this efficiently and hence doesn't waste resources keeping the large chunks in memory, that Spark would keep. Unlike Spark it just doesn't know if all the pairs in a Map phase will be needed in the next step.

The fact that it appears that Spark is keeping the whole dataset in memory hence isn't something Spark actively does, but rather a result of the way Spark is able to plan the execution of a job. On the other hand, Spark may be able to actually keep fewer things memory in a different kind of job. Counting the number of distinct words is a good example here in my opinion. Here Spark would have planned ahead and immediately drop a repeat-word from the cache/memory when encountering it during the mapping, while in Hadoop it would go ahead and waste memory on shuffling the repeat words too (I acknowledge there is a million ways to also make Hadoop do this but it's not out of the box, also there is ways of writing your Spark job in unfortunate ways to break these optimisations, but it's not so easy to fool Spark here :)).

Hope this helps understand that the memory use is just a natural consequence of the way Spark works, but not something actively aimed at and also not something strictly required by Spark. It is also perfectly capable of repeatedly spilling data back to disk between steps of the execution when memory becomes an issue.

For more insight into this I recommend learning about the DAG scheduler in Spark from here to see how this is actually done in code. You'll see that it always follows the pattern of working out where what data is and will be cached before figuring out what to calculate where.

like image 94
Armin Braun Avatar answered Oct 21 '22 08:10

Armin Braun