I'm trying to write a unit test case that relies on DataFrame.saveAsTable()
(since it is backed by a file system). I point the hive warehouse parameter to a local disk location:
sql.sql(s"SET hive.metastore.warehouse.dir=file:///home/myusername/hive/warehouse")
By default, Embedded Mode of metastore should be enabled, thus doesn't require an external database.
But HiveContext seems to be ignoring this configuration: since I still get this error when calling saveAsTable():
MetaException(message:file:/user/hive/warehouse/users is not a directory or unable to create one)
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:file:/user/hive/warehouse/users is not a directory or unable to create one)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:619)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:172)
at org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:224)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:54)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:54)
at org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:64)
at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1099)
at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1099)
at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1121)
at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1071)
at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1037)
This is quite annoying, why is it still happening and how to fix it?
According to http://spark.apache.org/docs/latest/sql-programming-guide.html#sql
Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse.
tl;dr Set hive.metastore.warehouse.dir
while creating a SQLContext
(or SparkSession
).
The location of the default database for the Hive metastore warehouse is /user/hive/warehouse
by default. It used to be set using hive.metastore.warehouse.dir
Hive-specific configuration property (in a Hadoop configuration).
It's been a while since you asked this question (it's Spark 2.3 days), but that part has not changed since - if you use sql
method of SQLContext
(or SparkSession
these days), it's simply too late to change where Spark creates the metastore database. It is far too late as the underlying infrastructure has been set up already (so you can use the SQLContext
). The warehouse location has to be set up before the HiveContext
/ SQLContext
/ SparkSession
initialization.
You should set hive.metastore.warehouse.dir
while creating SparkSession
(or SQLContext
before Spark SQL 2.0) using config and (very important) enable the Hive support using enableHiveSupport.
config(key: String, value: String): Builder Sets a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession's own configuration.
enableHiveSupport(): Builder Enables Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions.
You could use hive-site.xml
configuration file or spark.hadoop
prefix, but I'm digressing (and it strongly depends on the current configuration).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With