Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sparksql drop hive table

I want to drop a hive table through sparksql.

In a cluster with hadoop 2.6, hive 2.0, spark 1.6 and spark 2.0 installed. I tried the following code in pyspark shell and spark-submit job with both version.

sqlContext.sql('drop table test')  //spark 1.6
spark.sql('drop table test')       //spark 2.0

Both code works fine in pyspark-shell , I can see from the hive cli that the test table no longer exist.

However, if the code was in a python file and later submitted to cluster using spark-submit, the code never took effect.

spark 2.0 even gave error about

pyspark.sql.utils.AnalysisException: u"Table to drop '`try`' does not exist;"

I have copied hive-site.xml into the conf directory in spark.

What would be the correct way to drop a hive table through sparksql?

Update:

I tried compared the spark environment between the spark-shell and the job I submitted using the following code

spark-submit --master yarn --deploy-mode cluster try_spark_sql.py

In the spark-shell environment, I can see spark.sql.catalogImplementation is set to hive

IN the job submitted using the above code. The environment doesn't contain spark.sql.catalogImplementation I tried setting it using the following code:

spark = SparkSession.builder.appName("PythonSQL").config("spark.sql.catalogImplementation","hive").

But it doesn't have any effect on the environment.

One workaround I found is submitting the job using client mode instead of cluster mode. Then the hive table can be successfully dropped.

like image 828
Heyang Wang Avatar asked Sep 30 '16 09:09

Heyang Wang


3 Answers

When you are using PySpark shell, Spark has inbuilt Hive support, the default SQLContext implementation (the one available as a sqlContext) is HiveContext.

In your application if you are using plain SQLContext which doesn't provide Hive capabilities.

Please do as below , it should work.

from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql('drop table test')
like image 70
Arunakiran Nulu Avatar answered Oct 28 '22 14:10

Arunakiran Nulu


From the error it is quite clear that the table does not exist in the current database you are using. You need to change your SQL as follows:

sqlContext.sql('DROP TABLE IF EXISTS db_name.table_name')  //spark 1.6
spark.sql('DROP TABLE IF EXISTS db_name.table_name')       //spark 2.0

Any SQL query executed with HiveContext in Spark 1.6 or SparkSession in Spark 2.0, searches for the table in Hives' default database. We need to either execute the query spark.sql("USE db_name") and then execute the SQL specific to the table or prefix the name of the Database with the name of the Table (e.g testDB.testTable) for the query to execute properly as shown above.

like image 27
Yayati Sule Avatar answered Oct 28 '22 15:10

Yayati Sule


Slight change...This worked for me:

spark.sql("DROP TABLE IF EXIST table_name") 
like image 45
CMahler Avatar answered Oct 28 '22 16:10

CMahler