Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are Spark DataFrames ever implicitly cached?

I have recently understood that Spark DAGs get executed lazily, and intermediate results are never cached unless you explicitly call DF.cache().

Now I've ran an experiment that should give me different random numbers every time, based on that fact:

from pyspark.sql.functions import rand

df = spark.range(0, 3)
df = df.select("id", rand().alias('rand'))

df.show()

Executing these lines multiple times gives me different random numbers each time, as expected. But if the computed values (rand() in this case) are never stored, then calling just df.show() repeatedly should give me new random numbers every time, because the 'rand' column is not cached, right?

df.show()

This command called a second time gives me the same random numbers as before though. So the values are stored somewhere now, which I thought does not happen.

Where is my thinking wrong? And could you give me a minimal example of non-caching that results in new random numbers every time?

like image 729
Alexander Engelhardt Avatar asked Sep 03 '25 08:09

Alexander Engelhardt


1 Answers

The random seed parameter of rand() is set when rand().alias('rand') is called inside the select method and does not change afterwards. Therefore, calling show multiple times does always use the same random seed and hence the result is the same.

You can see it more clearly when you return the result of rand().alias('rand') by itself, which also shows the random seed parameter:

>>> rand().alias('rand')
Column<b'rand(166937772096155366) AS `rand`'>

When providing the seed directly, it will show up accordingly:

>>> rand(seed=22).alias('rand') 
Column<b'rand(22) AS `rand`'>

The random seed is set when calling rand() and is stored as a column expression within the select method. Therefore the result is the same. You will get different results when reevaluating rand() everytime like df.select("id", rand().alias('rand')).show().

like image 91
pansen Avatar answered Sep 05 '25 00:09

pansen