Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark on embedded mode - user/hive/warehouse not found

I'm using Apache Spark in embedded local mode. I have all the dependencies included in my pom.xml and in the same version (spark-core_2.10, spark-sql_2.10, and spark-hive_2.10).

I just want to run a HiveQL query to create a table (stored as Parquet).

Running the following (rather simple) code:

public class App {
    public static void main(String[] args) throws IOException, ClassNotFoundException {

        SparkConf sparkConf = new SparkConf().setAppName("JavaSparkSQL").setMaster("local[2]").set("spark.executor.memory", "1g");
        JavaSparkContext ctx = new JavaSparkContext(sparkConf);
        HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(ctx.sc());

        String createQuery = "CREATE TABLE IF NOT EXISTS Test (id int, name string) STORED AS PARQUET";
        sqlContext.sql(createQuery);
    }
}

...is returning the following exception:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:file:/user/hive/warehouse/test is not a directory or unable to create one)

I can see the metastore_db folder created in the root of the project.

I searched around and the solutions found didn't help --most of them were not for the embedded mode.

  • One solution was to check the permissions, I'm using the same user for everything.
  • Another solution was to create the folder manually in HDFS, I did and I can navigate to /user/hive/warehouse/test.
  • One solution was to set manually the metastore by adding: sqlContext.sql("SET hive.metastore.warehouse.dir=hdfs://localhost:9000/user/hive/warehouse");.

I'm running out of ideas right now, can someone provide any other suggestions?

like image 492
Infogeek Avatar asked Aug 13 '15 10:08

Infogeek


People also ask

Where are the spark configuration files stored in hive?

For example, Spark 1.x configuration files are stored in /etc/spark/conf.cloudera.spark_on_yarn directory (and that’s what /etc/spark/conf points at under normal circumstances). It turned out that just setting SPARK_CONF_DIR environment variable to Spark 1.x configuration directory before running Hive CLI is enough to make Hive use Spark 1.x.

Why Hive on Spark can't run?

This change made Spark 1 effectively unavailable - both its executables and libraries became symlinks to their Spark 2 counterparts. As a result, Hive on Spark refused to run, as in CDH 5.x it can only work with Spark 1.x. One of the possible approaches to this problem was to rollback the change of default Spark version.

How do I write data from Spark to a hive table?

Using Hive Warehouse Connector, you can use Spark streaming to write data into Hive tables. Follow the steps below to create a Hive Warehouse Connector example that ingests data from a Spark stream on localhost port 9999 into a Hive table. Open a terminal on your Spark cluster.

How to check hive-warehouse connector version in spark?

Use ssh command to connect to your Apache Spark cluster. Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command: From your ssh session, execute the following command to note the hive-warehouse-connector-assembly version:


2 Answers

Just in case this helps anybody else in the future, I'm attempting to write some unit tests against Spark code that uses a HiveContext. I've found that in order to change the path where the files are written for the tests, I needed to call hiveContext.setConf. I also tried the same approach as OP, performing a SET query, but that didn't work. The following seems to work!

hive.setConf("hive.metastore.warehouse.dir", 
  "file:///custom/path/to/hive/warehouse")

And just to make this a tad more useful, I specifically set this path to a location my code had access to:

hive.setConf("hive.metastore.warehouse.dir", 
  getClass.getResource(".").toString)

With this, I've been able to write unit tests against my code making use of hive queries and the Spark API.

like image 171
Steven Bakhtiari Avatar answered Nov 05 '22 13:11

Steven Bakhtiari


Because you're running in local embedded mode, HDFS is not being considered. This is why the error says file:/user/hive/warehouse/test rather than hdfs://localhost:9000/user/hive/warehouse/test. It expects /user/hive/warehouse/test to exist on your local machine. Try creating it locally.

like image 20
mattinbits Avatar answered Nov 05 '22 15:11

mattinbits