Is there any way to cache a cache sql query result without using rdd.cache()? for examples:
output = sqlContext.sql("SELECT * From people")
We can use output.cache() to cache the result, but then we cannot use sql query to deal with it.
So I want to ask is there anything like sqlcontext.cacheTable() to cache the result?
You should use sqlContext. cacheTable("table_name") in order to cache it, or alternatively use CACHE TABLE table_name SQL query.
Caching methods in SparkDISK_ONLY: Persist data on disk only in serialized format. MEMORY_ONLY: Persist data in memory only in deserialized format. MEMORY_AND_DISK: Persist data in memory and if enough memory is not available evicted blocks will be stored on disk. OFF_HEAP: Data is persisted in off-heap memory.
Only cache the table when it is first used, instead of immediately. The name of the table to be cached. OPTIONS clause with storageLevel key and value pair. A Warning is issued when a key other than storageLevel is used.
You should use sqlContext.cacheTable("table_name")
in order to cache it, or alternatively use CACHE TABLE table_name
SQL query.
Here's an example. I've got this file on HDFS:
1|Alex|[email protected]
2|Paul|[email protected]
3|John|[email protected]
Then the code in PySpark:
people = sc.textFile('hdfs://sparkdemo:8020/people.txt')
people_t = people.map(lambda x: x.split('|')).map(lambda x: Row(id=x[0], name=x[1], email=x[2]))
tbl = sqlContext.inferSchema(people_t)
tbl.registerTempTable('people')
Now we have a table and can query it:
sqlContext.sql('select * from people').collect()
To persist it, we have 3 options:
# 1st - using SQL
sqlContext.sql('CACHE TABLE people').collect()
# 2nd - using SQLContext
sqlContext.cacheTable('people')
sqlContext.sql('select count(*) from people').collect()
# 3rd - using Spark cache underlying RDD
tbl.cache()
sqlContext.sql('select count(*) from people').collect()
1st and 2nd options are preferred as they would cache the data in optimized in-memory columnar format, while 3rd would cache it just as any other RDD in row-oriented fashion
So going back to your question, here's one possible solution:
output = sqlContext.sql("SELECT * From people")
output.registerTempTable('people2')
sqlContext.cacheTable('people2')
sqlContext.sql("SELECT count(*) From people2").collect()
The following is most like using .cache for RDDs and helpful in Zeppelin or similar SQL-heavy-environments
CACHE TABLE CACHED_TABLE AS
SELECT $interesting_query
then you get cached reads both for subsequent usages of interesting_query
, as well as on all queries on CACHED_TABLE
.
This answer is based off of the accepted answer, but the power of using AS
is what really made the call useful in the more constrained SQL-only environments, where you cannot .collect()
or do RDD/Dataframe-operations in any way.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With