Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why there are many spark-warehouse folders got created?

I have installed hadoop 2.8.1 on ubuntu and then installed spark-2.2.0-bin-hadoop2.7 on it. I used spark-shell and created the tables. Again I used beeline and created tables. I have observed that there are three different folders got created named spark-warehouse as :

1- spark-2.2.0-bin-hadoop2.7/spark-warehouse

2- spark-2.2.0-bin-hadoop2.7/bin/spark-warehouse

3- spark-2.2.0-bin-hadoop2.7/sbin/spark-warehouse

What is exactly spark-warehouse and why are these created many times? Sometimes my spark shell and beeline shows different databases and tables and sometimes it show same. I am not getting what is happening?

Further, I did not installed hive but still I am able to use beeline and also I can access the databases though java program. How the hive came on my machine? Please help me. I am new to spark and installed it by online tutorials.

Below is the java code I was using to connect apache spark though JDBC:

 private static String driverName = "org.apache.hive.jdbc.HiveDriver";

public static void main(String[] args) throws SQLException {
    try {
        Class.forName(driverName);
    } catch (ClassNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
        System.exit(1);
    }
    Connection con = DriverManager.getConnection("jdbc:hive2://10.171.0.117:10000/default", "", "");
    Statement stmt = con.createStatement();
like image 484
ABC Avatar asked Aug 22 '17 13:08

ABC


2 Answers

What is exactly spark-warehouse and why are these created many times?

Unless configured otherwise, Spark will create an internal Derby database named metastore_db with a derby.log. Looks like you've not changed that.

This is the default behavior, as point out in the Documentation

When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started

Sometimes my spark shell and beeline shows different databases and tables and sometimes it show same

You're starting those commands in those different folders, so what you see is only confined to the current working directory.

I used beeline and created tables... How the hive came on my machine?

It didn't. You're probably connecting to the either the Spark Thrift Server, which is fully compatible with HiveServer2 protocol, the Derby database, as mentioned, or, you actually do have a HiveServer2 instance sitting at 10.171.0.117

Anyways, the JDBC connection is not required here. You can use SparkSession.sql function directly.

like image 63
OneCricketeer Avatar answered Dec 07 '22 21:12

OneCricketeer


In standalone mode, Spark will create the metastore in the directory from where it was launched. This is explained here: https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

So you should set spark.sql.warehouse.dir, or simply make sure you always start your spark job from the same directory (run bin/spark instead of cd bin ; ./spark, etc.).

like image 29
FurryMachine Avatar answered Dec 07 '22 20:12

FurryMachine