Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cache not preventing multiple filescans?

I have a question regarding the usage of DataFram APIs cache. Consider the following query:

val dfA = spark.table(tablename)
.cache

val dfC = dfA
.join(dfA.groupBy($"day").count,Seq("day"),"left")

So dfA is used twice in this query, so I thought caching it would be benefical. But I'm confused about the plan, the table is still scanned twice (FileScan appearing twice):

dfC.explain

== Physical Plan ==
*Project [day#8232, i#8233, count#8251L]
+- SortMergeJoin [day#8232], [day#8255], LeftOuter
   :- *Sort [day#8232 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(day#8232, 200)
   :     +- InMemoryTableScan [day#8232, i#8233]
   :           +- InMemoryRelation [day#8232, i#8233], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
   :                 +- *FileScan parquet mytable[day#8232,i#8233] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://tablelocation], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<day:int,i:int>
   +- *Sort [day#8255 ASC NULLS FIRST], false, 0
      +- *HashAggregate(keys=[day#8255], functions=[count(1)])
         +- Exchange hashpartitioning(day#8255, 200)
            +- *HashAggregate(keys=[day#8255], functions=[partial_count(1)])
               +- InMemoryTableScan [day#8255]
                     +- InMemoryRelation [day#8255, i#8256], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
                           +- *FileScan parquet mytable[day#8232,i#8233] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://tablelocation], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<day:int,i:int>

Why isn't the table cached? Im using Spark 2.1.1

like image 832
Raphael Roth Avatar asked Dec 30 '25 20:12

Raphael Roth


1 Answers

Try with count() after cache so you trigger one action and the caching is done before the plan of the second one is "calculated".

As far as I know, the first action will trigger the cache, but since Spark planning is not dynamic, if your first action after cache uses the table twice, it will have to read it twice (because it won't cache the table until it executes that action).

If the above doesn't work [and/or you are hitting the bug mentioned], it's probably related to the plan, you can also try transforming the DF to RDD and then back to RDD (this way the plan will be 100% exact).

like image 61
BiS Avatar answered Jan 01 '26 20:01

BiS