Does spark automatically cache some results?

Question

I run an action two times, and the second time takes very little time to run, so I suspect that spark automatically cache some results. But I did find any source.

Im using Spark1.4.

doc = sc.textFile('...')
doc_wc = doc.flatMap(lambda x: re.split('\W', x))\
            .filter(lambda x: x != '') \
            .map(lambda word: (word, 1)) \
            .reduceByKey(lambda x,y: x+y) 
%%time
doc_wc.take(5) # first time
# CPU times: user 10.7 ms, sys: 425 µs, total: 11.1 ms
# Wall time: 4.39 s

%%time
doc_wc.take(5) # second time
# CPU times: user 6.13 ms, sys: 276 µs, total: 6.41 ms
# Wall time: 151 ms

dpeacock · Accepted Answer

From the documentation:

Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it.

The underlying filesystem will also be caching access to the disk.

Does spark automatically cache some results?

Tags:

caching

apache-spark

yalei du

1 Answers

dpeacock

Recent Activity

Donate For Us

Does spark automatically cache some results?

Tags:

caching

apache-spark

yalei du

1 Answers

dpeacock

Related questions

Recent Activity

Donate For Us