Cache not preventing multiple filescans?

Question

I have a question regarding the usage of DataFram APIs cache. Consider the following query:

val dfA = spark.table(tablename)
.cache

val dfC = dfA
.join(dfA.groupBy($"day").count,Seq("day"),"left")

So dfA is used twice in this query, so I thought caching it would be benefical. But I'm confused about the plan, the table is still scanned twice (FileScan appearing twice):

dfC.explain

== Physical Plan ==
*Project [day#8232, i#8233, count#8251L]
+- SortMergeJoin [day#8232], [day#8255], LeftOuter
   :- *Sort [day#8232 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(day#8232, 200)
   :     +- InMemoryTableScan [day#8232, i#8233]
   :           +- InMemoryRelation [day#8232, i#8233], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
   :                 +- *FileScan parquet mytable[day#8232,i#8233] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://tablelocation], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<day:int,i:int>
   +- *Sort [day#8255 ASC NULLS FIRST], false, 0
      +- *HashAggregate(keys=[day#8255], functions=[count(1)])
         +- Exchange hashpartitioning(day#8255, 200)
            +- *HashAggregate(keys=[day#8255], functions=[partial_count(1)])
               +- InMemoryTableScan [day#8255]
                     +- InMemoryRelation [day#8255, i#8256], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
                           +- *FileScan parquet mytable[day#8232,i#8233] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://tablelocation], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<day:int,i:int>

Why isn't the table cached? Im using Spark 2.1.1

BiS · Accepted Answer

Try with count() after cache so you trigger one action and the caching is done before the plan of the second one is "calculated".

As far as I know, the first action will trigger the cache, but since Spark planning is not dynamic, if your first action after cache uses the table twice, it will have to read it twice (because it won't cache the table until it executes that action).

If the above doesn't work [and/or you are hitting the bug mentioned], it's probably related to the plan, you can also try transforming the DF to RDD and then back to RDD (this way the plan will be 100% exact).

Cache not preventing multiple filescans?

Tags:

dataframe

caching

apache-spark

Raphael Roth

1 Answers

BiS

Recent Activity

Donate For Us

Cache not preventing multiple filescans?

Tags:

dataframe

caching

apache-spark

Raphael Roth

1 Answers

BiS

Related questions

Recent Activity

Donate For Us