Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get rid of derby.log, metastore_db from Spark Shell

When running spark-shell it creates a file derby.log and a folder metastore_db. How do I configure spark to put these somewhere else?

For derby log I've tried Getting rid of derby.log like so spark-shell --driver-memory 10g --conf "-spark.driver.extraJavaOptions=Dderby.stream.info.file=/dev/null" with a couple of different properties but spark ignores them.

Does anyone know how to get rid of these or specify a default directory for them?

like image 567
Carlos Bribiescas Avatar asked Jul 14 '16 14:07

Carlos Bribiescas


People also ask

Can I delete Derby log?

Content. The bottom line: Never delete or manipulate *ANY* files in a Derby database directory structure. Doing so will corrupt the database. The problem: Some Derby users have discovered that deleting the files in the log subdirectory of the database will silence recovery errors and allow access the database.

What is Derby log file?

The derby. log file is created when the Derby server is started. The Network Server then records the time and version. If a log file exists, it is overwritten, unless the property derby.

What is spark Metastore?

A Hive metastore warehouse (aka spark-warehouse) is the directory where Spark SQL persists tables whereas a Hive metastore (aka metastore_db) is a relational database to manage the metadata of the persistent relational entities, e.g. databases, tables, columns, partitions.


6 Answers

The use of the hive.metastore.warehouse.dir is deprecated since Spark 2.0.0, see the docs.

As hinted by this answer, the real culprit for both the metastore_db directory and the derby.log file being created in every working subdirectory is the derby.system.home property defaulting to ..

Thus, a default location for both can be specified by adding the following line to spark-defaults.conf:

spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby

where /tmp/derby can be replaced by the directory of your choice.

like image 156
hiryu Avatar answered Oct 02 '22 04:10

hiryu


For spark-shell, to avoid having the metastore_db directory and avoid doing it in the code (since the context/session is already created and you won't stop them and recreate them with the new configuration each time), you have to set its location in hive-site.xml file and copy this file into spark conf directory.
A sample hive-site.xml file to make the location of metastore_db in /tmp (refer to my answer here):

<configuration>
   <property>
     <name>javax.jdo.option.ConnectionURL</name>
     <value>jdbc:derby:;databaseName=/tmp/metastore_db;create=true</value>
     <description>JDBC connect string for a JDBC metastore</description>
   </property>
   <property>
     <name>javax.jdo.option.ConnectionDriverName</name>
     <value>org.apache.derby.jdbc.EmbeddedDriver</value>
     <description>Driver class name for a JDBC metastore</description>
   </property>
   <property>
      <name>hive.metastore.warehouse.dir</name>
      <value>/tmp/</value>
      <description>location of default database for the warehouse</description>
   </property>
</configuration>

After that you could start your spark-shell as the following to get rid of derby.log as well

$ spark-shell --conf "spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp"
like image 32
user1314742 Avatar answered Oct 02 '22 04:10

user1314742


Try setting derby.system.home to some other directory as a system property before firing up the spark shell. Derby will create new databases there. The default value for this property is .

Reference: https://db.apache.org/derby/integrate/plugin_help/properties.html

like image 21
Bill Avatar answered Sep 30 '22 04:09

Bill


Use hive.metastore.warehouse.dir property. From docs:

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate()

For derby log: Getting rid of derby.log could be the answer. In general create derby.properties file in your working directory with following content:

derby.stream.error.file=/path/to/desired/log/file
like image 26
3 revsuser6022341 Avatar answered Oct 02 '22 04:10

3 revsuser6022341


For me setting the Spark property didn't work, neither on the driver nor the executor. So searching for this issue, I ended up setting the property for my system instead with:

System.setProperty("derby.system.home", "D:\\tmp\\derby")

val spark: SparkSession = SparkSession.builder
    .appName("UT session")
    .master("local[*]")
    .enableHiveSupport
    .getOrCreate

[...]

And that finally got me rid of those annoying items.

like image 37
Adrien Brunelat Avatar answered Oct 04 '22 04:10

Adrien Brunelat


In case if you are using Jupyter/Jupyterhub/Jupyterlab or just setting this conf parameter inside python, use the following will work:

from pyspark import SparkConf, SparkContext

conf = (SparkConf()
    .setMaster("local[*]")
    .set('spark.driver.extraJavaOptions','-Dderby.system.home=/tmp/derby')
   )

sc = SparkContext(conf = conf)
like image 45
kennyut Avatar answered Oct 02 '22 04:10

kennyut