Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to refresh a table and do it concurrently?

I'm using Spark Streaming 2.1. I'd like to refresh some cached table (loaded by spark provided DataSource like parquet, MySQL or user-defined data sources) periodically.

  1. how to refresh the table?

    Suppose I have some table loaded by


    and it is also cached by

    spark.sql("cache table my_table")

    is it enough with following code to refresh the table, and when the table is loaded next, it will automatically be cached

    spark.sql("refresh table my_table")

    or do I have to do that manually with

    spark.table("my_table").unpersist spark.read.format("").load().createOrReplaceTempView("my_table") spark.sql("cache table my_table")

  2. is it safe to refresh the table concurrently?

    By concurrent I mean using ScheduledThreadPoolExecutor to do the refresh work apart from the main thread.

    What will happen if the Spark is using the cached table when I call refresh on the table?

like image 214
宇宙人 Avatar asked Aug 22 '17 04:08


Video Answer

1 Answers

In Spark 2.2.0 they have introduced feature of refreshing the metadata of a table if it was updated by hive or some external tools.

You can achieve it by using the API,


This API will update the metadata for that table to keep it consistent.

like image 193
Ganesh Avatar answered Oct 10 '22 01:10
