Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to integrate HIVE access into PySpark derived from pip and conda (not from a Spark distribution or package)

I build and programmatically use my PySpark environment from the ground-up via conda and pip pyspark (like I demonstrate Here); rather than use PySpark from the downloadable Spark distribution. As you can see in the first code-snippet of the URL above, I accomplish this through (among other things) k/v conf-pairs in my SparkSession startup script. (By the way, this approach enables me to do work in various REPLs, IDEs and JUPYTER).

However, with respect to configuring Spark support for accessing HIVE databases and metadata-stores, the manual says this:

Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.

By conf/ above they mean the conf/ directory in the Spark distribution package. But pyspark via pip and conda doesn't have that directory, of course, so how might HIVE database and metastore support be plugged into Spark in that case?

I suspect this might be accommodated by specially-prefixed SparkConf K/V pairs of the form: spark.hadoop.* (see here); and if yes, I'd still need to determine which HADOOP / HIVE / CORE directives are needed. I guess I'll trial & error that. :)

Note: .enableHiveSupport() has already been included.

I'll tinker with spark.hadoop.* K/V pairs, but if anyone knows how this is done offhand, please do let me know.

Thank you. :)

EDIT: After the solution was provided, I updated the content in the first URL above. It now integrates the SPARK_CONF_DIR and HADOOP_CONF_DIR environment variable approach discussed below.

like image 253
NYCeyes Avatar asked Oct 17 '22 07:10

NYCeyes


1 Answers

In this case I'd recommend the official configuration guide (emphasis mine):

If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that should be included on Spark’s classpath:

  • hdfs-site.xml, which provides default behaviors for the HDFS client.
  • core-site.xml, which sets the default filesystem name.

(...)

To make these files visible to Spark, set HADOOP_CONF_DIR in $SPARK_HOME/conf/spark-env.sh to a location containing the configuration files.

Additionally:

To specify a different configuration directory other than the default “SPARK_HOME/conf”, you can set SPARK_CONF_DIR. Spark will use the configuration files (spark-defaults.conf, spark-env.sh, log4j.properties, etc) from this directory.

So it is possible to use arbitrary directory accessible to your Spark installation to place desired configuration files, and SPARK_CONF_DIR and / or HADOOP_CONF_DIR can be easily set directly in your script, using os.environ.

Finally there is even no need for separate Hadoop configuration files, most of the time, as Hadoop specific properties can be set directly in Spark documentation, using spark.hadoop.* prefix.

like image 103
user10938362 Avatar answered Nov 15 '22 05:11

user10938362