When running spark-shell
it creates a file derby.log
and a folder metastore_db
. How do I configure spark to put these somewhere else?
For derby log I've tried Getting rid of derby.log like so spark-shell --driver-memory 10g --conf "-spark.driver.extraJavaOptions=Dderby.stream.info.file=/dev/null"
with a couple of different properties but spark ignores them.
Does anyone know how to get rid of these or specify a default directory for them?
Content. The bottom line: Never delete or manipulate *ANY* files in a Derby database directory structure. Doing so will corrupt the database. The problem: Some Derby users have discovered that deleting the files in the log subdirectory of the database will silence recovery errors and allow access the database.
The derby. log file is created when the Derby server is started. The Network Server then records the time and version. If a log file exists, it is overwritten, unless the property derby.
A Hive metastore warehouse (aka spark-warehouse) is the directory where Spark SQL persists tables whereas a Hive metastore (aka metastore_db) is a relational database to manage the metadata of the persistent relational entities, e.g. databases, tables, columns, partitions.
The use of the hive.metastore.warehouse.dir
is deprecated since Spark 2.0.0,
see the docs.
As hinted by this answer, the real culprit for both the metastore_db
directory and the derby.log
file being created in every working subdirectory is the derby.system.home
property defaulting to .
.
Thus, a default location for both can be specified by adding the following line to spark-defaults.conf
:
spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby
where /tmp/derby
can be replaced by the directory of your choice.
For spark-shell, to avoid having the metastore_db
directory and avoid doing it in the code (since the context/session is already created and you won't stop them and recreate them with the new configuration each time), you have to set its location in hive-site.xml
file and copy this file into spark conf directory.
A sample hive-site.xml
file to make the location of metastore_db
in /tmp
(refer to my answer here):
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/tmp/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/tmp/</value>
<description>location of default database for the warehouse</description>
</property>
</configuration>
After that you could start your spark-shell
as the following to get rid of derby.log
as well
$ spark-shell --conf "spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp"
Try setting derby.system.home
to some other directory as a system property before firing up the spark shell. Derby will create new databases there. The default value for this property is .
Reference: https://db.apache.org/derby/integrate/plugin_help/properties.html
Use hive.metastore.warehouse.dir
property. From docs:
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
For derby log: Getting rid of derby.log could be the answer. In general create derby.properties
file in your working directory with following content:
derby.stream.error.file=/path/to/desired/log/file
For me setting the Spark property didn't work, neither on the driver nor the executor. So searching for this issue, I ended up setting the property for my system instead with:
System.setProperty("derby.system.home", "D:\\tmp\\derby")
val spark: SparkSession = SparkSession.builder
.appName("UT session")
.master("local[*]")
.enableHiveSupport
.getOrCreate
[...]
And that finally got me rid of those annoying items.
In case if you are using Jupyter/Jupyterhub/Jupyterlab or just setting this conf parameter inside python, use the following will work:
from pyspark import SparkConf, SparkContext
conf = (SparkConf()
.setMaster("local[*]")
.set('spark.driver.extraJavaOptions','-Dderby.system.home=/tmp/derby')
)
sc = SparkContext(conf = conf)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With