I build and programmatically use my PySpark
environment from the ground-up via conda
and pip
pyspark (like I demonstrate Here); rather than use PySpark
from the downloadable Spark distribution. As you can see in the first code-snippet of the URL above, I accomplish this through (among other things) k/v conf-pairs in my SparkSession startup script. (By the way, this approach enables me to do work in various REPLs, IDEs and JUPYTER).
However, with respect to configuring Spark support for accessing HIVE databases and metadata-stores, the manual says this:
Configuration of
Hive
is done by placing yourhive-site.xml
,core-site.xml
(for security configuration), andhdfs-site.xml
(for HDFS configuration) file inconf/
.
By conf/
above they mean the conf/
directory in the Spark distribution package. But pyspark
via pip
and conda
doesn't have that directory, of course, so how might HIVE database and metastore support be plugged into Spark in that case?
I suspect this might be accommodated by specially-prefixed SparkConf K/V pairs of the form: spark.hadoop.*
(see here); and if yes, I'd still need to determine which HADOOP / HIVE / CORE directives are needed. I guess I'll trial & error that. :)
Note: .enableHiveSupport()
has already been included.
I'll tinker with spark.hadoop.*
K/V pairs, but if anyone knows how this is done offhand, please do let me know.
Thank you. :)
EDIT: After the solution was provided, I updated the content in the first URL above. It now integrates the SPARK_CONF_DIR
and HADOOP_CONF_DIR
environment variable approach discussed below.
In this case I'd recommend the official configuration guide (emphasis mine):
If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that should be included on Spark’s classpath:
- hdfs-site.xml, which provides default behaviors for the HDFS client.
- core-site.xml, which sets the default filesystem name.
(...)
To make these files visible to Spark, set
HADOOP_CONF_DIR
in$SPARK_HOME/conf/spark-env.sh
to a location containing the configuration files.
Additionally:
To specify a different configuration directory other than the default “
SPARK_HOME/conf
”, you can setSPARK_CONF_DIR
. Spark will use the configuration files (spark-defaults.conf, spark-env.sh, log4j.properties, etc) from this directory.
So it is possible to use arbitrary directory accessible to your Spark installation to place desired configuration files, and SPARK_CONF_DIR
and / or HADOOP_CONF_DIR
can be easily set directly in your script, using os.environ
.
Finally there is even no need for separate Hadoop configuration files, most of the time, as Hadoop specific properties can be set directly in Spark documentation, using spark.hadoop.*
prefix.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With