I'm trying to persist a temp view with the purpose of querying it again via sql:
val df = spark.sqlContext.read.option("header", true).csv("xxx.csv")
df.createOrReplaceTempView("xxx")
persist/cache:
df.cache() // or
spark.sqlContext.cacheTable("xxx") // or
df.persist(MEMORY_AND_DISK) // or
spark.sql("CACHE TABLE xxx")
Then I move the underlying xxx.csv
, and:
spark.sql("select * from xxx")
Upon which, I find that only CACHE TABLE xxx
stores a copy. What am I doing wrong, how can persist eg. DISK_ONLY
a queryable view/table?
createOrReplaceTempView("dfTEMP") , so now every time you will query dfTEMP such as val df1 = spark. sql("select * from dfTEMP) you will read it from memory (1st action on df1 will actually cache it), do not worry about persistence for now as if df does not fit into memory, i will spill the rest to disk.
When you persist a dataset, each node stores its partitioned data in memory and reuses them in other actions on that dataset. And Spark's persisted data on nodes are fault-tolerant meaning if any partition of a Dataset is lost, it will automatically be recomputed using the original transformations that created it.
// Storing data in memory. val dataframePersist = dataframe.persist(StorageLevel.MEMORY_ONLY) dataframePersist.show(false) The persist() function stores the data into the memory. Spark Unpersist() marks Dataframe or Dataset as non-persistent, and it removes all the blocks for it from the memory and disk.
First cache it, as df.cache
, then register as df.createOrReplaceTempView("dfTEMP")
, so now every time you will query dfTEMP
such as val df1 = spark.sql("select * from dfTEMP)
you will read it from memory (1st action on df1
will actually cache it), do not worry about persistence for now as if df
does not fit into memory, i will spill the rest to disk.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With