Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark append executor environment variable

Is it possible to append a value to the PYTHONPATH of a worker in spark?

I know it is possible to go to each worker node, configure spark-env.sh file and do it, but I want a more flexible approach

I am trying to use setExecutorEnv method, but with no success

conf = SparkConf().setMaster("spark://192.168.10.11:7077")\
              .setAppName(''myname')\
              .set("spark.cassandra.connection.host", "192.168.10.11") /
              .setExecutorEnv('PYTHONPATH', '$PYTHONPATH:/custom_dir_that_I_want_to_append/')

It creates a pythonpath env.variable on each executor, force it to be lower_case, and does not interprets $PYTHONPATH command to append the value.

I end up with two different env.variables,

pythonpath  :  $PYTHONPATH:/custom_dir_that_I_want_to_append
PYTHONPATH  :  /old/path/to_python

The first one is dynamically created and the second one already existed before.

Does anyone know how to do it?

like image 378
guilhermecgs Avatar asked Nov 25 '16 15:11

guilhermecgs


People also ask

How do I set environment variables in PySpark?

Before starting PySpark, you need to set the following environments to set the Spark path and the Py4j path. Or, to set the above environments globally, put them in the . bashrc file. Then run the following command for the environments to work.

How do I set environment variables in Spark?

Spark Configuration Spark properties control most application parameters and can be set by using a SparkConf object, or through Java system properties. Environment variables can be used to set per-machine settings, such as the IP address, through the conf/spark-env.sh script on each node.

How do I set python path in PySpark?

By the following code you can change the python path only for the current job, which also allow different python path for driver and executors: PYSPARK_DRIVER_PYTHON=/home/user1/anaconda2/bin/python PYSPARK_PYTHON=/usr/local/anaconda2/bin/python pyspark --master .. Save this answer.

How can we define the executor memory for a Spark Program?

Each cluster worker node contains executors. An executor is a process that is launched for a Spark application on a worker node. Each executor memory is the sum of yarn overhead memory and JVM Heap memory.


1 Answers

I figured out myself...

The problem is not with spark, but in ConfigParser

Based on this answer, I fixed the ConfigParser to always preserve case.

After this, I found out that the default spark behavior is to append the values to existing worker env.variables, if there is a env.variable with the same name.

So, it is not necessary to mention $PYTHONPATH within dollar sign.

.setExecutorEnv('PYTHONPATH', '/custom_dir_that_I_want_to_append/')
like image 75
guilhermecgs Avatar answered Nov 15 '22 07:11

guilhermecgs