Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark SQL: how to cache sql query result without using rdd.cache()

Is there any way to cache a cache sql query result without using rdd.cache()? for examples:

output = sqlContext.sql("SELECT * From people")

We can use output.cache() to cache the result, but then we cannot use sql query to deal with it.

So I want to ask is there anything like sqlcontext.cacheTable() to cache the result?

like image 699
lwwwzh Avatar asked Jan 19 '15 14:01

lwwwzh


People also ask

Which of the following is the correct ways to cache the data tables in spark SQL?

You should use sqlContext. cacheTable("table_name") in order to cache it, or alternatively use CACHE TABLE table_name SQL query.

How do I cache data in spark?

Caching methods in SparkDISK_ONLY: Persist data on disk only in serialized format. MEMORY_ONLY: Persist data in memory only in deserialized format. MEMORY_AND_DISK: Persist data in memory and if enough memory is not available evicted blocks will be stored on disk. OFF_HEAP: Data is persisted in off-heap memory.

Can we cache table in spark?

Only cache the table when it is first used, instead of immediately. The name of the table to be cached. OPTIONS clause with storageLevel key and value pair. A Warning is issued when a key other than storageLevel is used.


2 Answers

You should use sqlContext.cacheTable("table_name") in order to cache it, or alternatively use CACHE TABLE table_name SQL query.

Here's an example. I've got this file on HDFS:

1|Alex|[email protected]
2|Paul|[email protected]
3|John|[email protected]

Then the code in PySpark:

people = sc.textFile('hdfs://sparkdemo:8020/people.txt')
people_t = people.map(lambda x: x.split('|')).map(lambda x: Row(id=x[0], name=x[1], email=x[2]))
tbl = sqlContext.inferSchema(people_t)
tbl.registerTempTable('people')

Now we have a table and can query it:

sqlContext.sql('select * from people').collect()

To persist it, we have 3 options:

# 1st - using SQL
sqlContext.sql('CACHE TABLE people').collect()
# 2nd - using SQLContext
sqlContext.cacheTable('people')
sqlContext.sql('select count(*) from people').collect()     
# 3rd - using Spark cache underlying RDD
tbl.cache()
sqlContext.sql('select count(*) from people').collect()     

1st and 2nd options are preferred as they would cache the data in optimized in-memory columnar format, while 3rd would cache it just as any other RDD in row-oriented fashion

So going back to your question, here's one possible solution:

output = sqlContext.sql("SELECT * From people")
output.registerTempTable('people2')
sqlContext.cacheTable('people2')
sqlContext.sql("SELECT count(*) From people2").collect()
like image 148
0x0FFF Avatar answered Sep 30 '22 23:09

0x0FFF


The following is most like using .cache for RDDs and helpful in Zeppelin or similar SQL-heavy-environments

CACHE TABLE CACHED_TABLE AS
SELECT $interesting_query

then you get cached reads both for subsequent usages of interesting_query, as well as on all queries on CACHED_TABLE.

This answer is based off of the accepted answer, but the power of using AS is what really made the call useful in the more constrained SQL-only environments, where you cannot .collect() or do RDD/Dataframe-operations in any way.

like image 43
Rick Moritz Avatar answered Sep 30 '22 22:09

Rick Moritz