Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark persist temp view

I'm trying to persist a temp view with the purpose of querying it again via sql:

val df = spark.sqlContext.read.option("header", true).csv("xxx.csv")
df.createOrReplaceTempView("xxx")

persist/cache:

df.cache()                          // or
spark.sqlContext.cacheTable("xxx")  // or
df.persist(MEMORY_AND_DISK)         // or
spark.sql("CACHE TABLE xxx")

Then I move the underlying xxx.csv, and:

spark.sql("select * from xxx")

Upon which, I find that only CACHE TABLE xxx stores a copy. What am I doing wrong, how can persist eg. DISK_ONLY a queryable view/table?

like image 401
darnok Avatar asked May 18 '17 11:05

darnok


People also ask

How do I persist the temperature view in spark?

createOrReplaceTempView("dfTEMP") , so now every time you will query dfTEMP such as val df1 = spark. sql("select * from dfTEMP) you will read it from memory (1st action on df1 will actually cache it), do not worry about persistence for now as if df does not fit into memory, i will spill the rest to disk.

How does persist work in spark?

When you persist a dataset, each node stores its partitioned data in memory and reuses them in other actions on that dataset. And Spark's persisted data on nodes are fault-tolerant meaning if any partition of a Dataset is lost, it will automatically be recomputed using the original transformations that created it.

What is persist and Unpersist in spark?

// Storing data in memory. val dataframePersist = dataframe.persist(StorageLevel.MEMORY_ONLY) dataframePersist.show(false) The persist() function stores the data into the memory. Spark Unpersist() marks Dataframe or Dataset as non-persistent, and it removes all the blocks for it from the memory and disk.


Video Answer


1 Answers

First cache it, as df.cache, then register as df.createOrReplaceTempView("dfTEMP"), so now every time you will query dfTEMP such as val df1 = spark.sql("select * from dfTEMP) you will read it from memory (1st action on df1 will actually cache it), do not worry about persistence for now as if df does not fit into memory, i will spill the rest to disk.

like image 110
elcomendante Avatar answered Sep 29 '22 07:09

elcomendante