The spark cache() function when used along with the repartition() doesn't cache the dataframe. Can anyone explain why this happens?
Edit:
df.repartition(1000).cache()
df.count()
I have tried doing them on separate lines and that works.
Edit:
df2 = df1.repartition(1000)
df2.cache()
df2.count()
I expected the dataframe to be cached but i can't see it in the storage on UI
Dataframes are immutable like RDD, So though you are calling repartition on df, you are not assigning it to any DF and the current df will not change.
df.repartition(1000).cache()
df.count()
Above one won't work.
df.repartition(1000)
df.cache()
df.count()
For above code if you check in storage, it wont show 1000 partitions cached. Storage will show the cached partitions as df.rdd.getNumPartitions(which will be not 1000).
So try this.
val df1 = df.repartition(1000).cache()
df1.count()
This should work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With