Is there any way to cache a cache sql query result without using rdd.cache()? for examples: <pre class="prettyprint"><code>output = sqlContext.sql("SELECT * From people") </code></pre> We can use output.cache() to cache the result, but then we cannot use sql query to deal with it. So I want to ask is there anything like sqlcontext.cacheTable() to cache the result?

You should use <code>sqlContext.cacheTable("table_name")</code> in order to cache it, or alternatively use <code>CACHE TABLE table_name</code> SQL query. Here's an example. I've got this file on HDFS: <pre class="prettyprint"><code>1|Alex|alex@gmail.com 2|Paul|paul@example.com 3|John|john@yahoo.com </code></pre> Then the code in PySpark: <pre class="prettyprint"><code>people = sc.textFile('hdfs://sparkdemo:8020/people.txt') people_t = people.map(lambda x: x.split('|')).map(lambda x: Row(id=x[0], name=x[1], email=x[2])) tbl = sqlContext.inferSchema(people_t) tbl.registerTempTable('people') </code></pre> Now we have a table and can query it: <pre class="prettyprint"><code>sqlContext.sql('select * from people').collect() </code></pre> To persist it, we have 3 options: <pre class="prettyprint"><code># 1st - using SQL sqlContext.sql('CACHE TABLE people').collect() # 2nd - using SQLContext sqlContext.cacheTable('people') sqlContext.sql('select count(*) from people').collect() # 3rd - using Spark cache underlying RDD tbl.cache() sqlContext.sql('select count(*) from people').collect() </code></pre> 1st and 2nd options are preferred as they would cache the data in optimized in-memory columnar format, while 3rd would cache it just as any other RDD in row-oriented fashion So going back to your question, here's one possible solution: <pre class="prettyprint"><code>output = sqlContext.sql("SELECT * From people") output.registerTempTable('people2') sqlContext.cacheTable('people2') sqlContext.sql("SELECT count(*) From people2").collect() </code></pre>

The following is most like using .cache for RDDs and helpful in Zeppelin or similar SQL-heavy-environments <pre class="prettyprint"><code>CACHE TABLE CACHED_TABLE AS SELECT $interesting_query </code></pre> then you get cached reads both for subsequent usages of <code>interesting_query</code>, as well as on all queries on <code>CACHED_TABLE</code>. This answer is based off of the accepted answer, but the power of using <code>AS</code> is what really made the call useful in the more constrained SQL-only environments, where you cannot <code>.collect()</code> or do RDD/Dataframe-operations in any way.

Spark SQL: how to cache sql query result without using rdd.cache()

Is there any way to cache a cache sql query result without using rdd.cache()? for examples:

output = sqlContext.sql("SELECT * From people")

We can use output.cache() to cache the result, but then we cannot use sql query to deal with it.

So I want to ask is there anything like sqlcontext.cacheTable() to cache the result?

Which of the following is the correct ways to cache the data tables in spark SQL?

You should use sqlContext. cacheTable("table_name") in order to cache it, or alternatively use CACHE TABLE table_name SQL query.

How do I cache data in spark?

Caching methods in SparkDISK_ONLY: Persist data on disk only in serialized format. MEMORY_ONLY: Persist data in memory only in deserialized format. MEMORY_AND_DISK: Persist data in memory and if enough memory is not available evicted blocks will be stored on disk. OFF_HEAP: Data is persisted in off-heap memory.

Can we cache table in spark?

Only cache the table when it is first used, instead of immediately. The name of the table to be cached. OPTIONS clause with storageLevel key and value pair. A Warning is issued when a key other than storageLevel is used.

You should use sqlContext.cacheTable("table_name") in order to cache it, or alternatively use CACHE TABLE table_name SQL query.

Here's an example. I've got this file on HDFS:

1|Alex|[email protected]
2|Paul|[email protected]
3|John|[email protected]

Then the code in PySpark:

people = sc.textFile('hdfs://sparkdemo:8020/people.txt')
people_t = people.map(lambda x: x.split('|')).map(lambda x: Row(id=x[0], name=x[1], email=x[2]))
tbl = sqlContext.inferSchema(people_t)
tbl.registerTempTable('people')

Now we have a table and can query it:

sqlContext.sql('select * from people').collect()

To persist it, we have 3 options:

# 1st - using SQL
sqlContext.sql('CACHE TABLE people').collect()
# 2nd - using SQLContext
sqlContext.cacheTable('people')
sqlContext.sql('select count(*) from people').collect()     
# 3rd - using Spark cache underlying RDD
tbl.cache()
sqlContext.sql('select count(*) from people').collect()

1st and 2nd options are preferred as they would cache the data in optimized in-memory columnar format, while 3rd would cache it just as any other RDD in row-oriented fashion

So going back to your question, here's one possible solution:

output = sqlContext.sql("SELECT * From people")
output.registerTempTable('people2')
sqlContext.cacheTable('people2')
sqlContext.sql("SELECT count(*) From people2").collect()

The following is most like using .cache for RDDs and helpful in Zeppelin or similar SQL-heavy-environments

CACHE TABLE CACHED_TABLE AS
SELECT $interesting_query

then you get cached reads both for subsequent usages of interesting_query, as well as on all queries on CACHED_TABLE.

This answer is based off of the accepted answer, but the power of using AS is what really made the call useful in the more constrained SQL-only environments, where you cannot .collect() or do RDD/Dataframe-operations in any way.

Spark SQL: how to cache sql query result without using rdd.cache()

Tags:

caching

apache-spark

query-optimization

lwwwzh

People also ask

2 Answers

0x0FFF

Rick Moritz

Recent Activity

Donate For Us

Spark SQL: how to cache sql query result without using rdd.cache()

Tags:

caching

apache-spark

query-optimization

lwwwzh

People also ask

2 Answers

0x0FFF

Rick Moritz

Related questions

Recent Activity

Donate For Us