Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does spark automatically cache some results?

I run an action two times, and the second time takes very little time to run, so I suspect that spark automatically cache some results. But I did find any source.

Im using Spark1.4.

doc = sc.textFile('...')
doc_wc = doc.flatMap(lambda x: re.split('\W', x))\
            .filter(lambda x: x != '') \
            .map(lambda word: (word, 1)) \
            .reduceByKey(lambda x,y: x+y) 
%%time
doc_wc.take(5) # first time
# CPU times: user 10.7 ms, sys: 425 µs, total: 11.1 ms
# Wall time: 4.39 s

%%time
doc_wc.take(5) # second time
# CPU times: user 6.13 ms, sys: 276 µs, total: 6.41 ms
# Wall time: 151 ms
like image 506
yalei du Avatar asked Dec 25 '22 17:12

yalei du


1 Answers

From the documentation:

Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it.

The underlying filesystem will also be caching access to the disk.

like image 91
dpeacock Avatar answered Jan 18 '23 12:01

dpeacock