Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run python3 on google's dataproc pyspark

I want to run a pyspark job through Google Cloud Platform dataproc, but I can't figure out how to setup pyspark to run python3 instead of 2.7 by default.

The best I've been able to find is adding these initialization commands

However, when I ssh into the cluster then
(a) python command is still python2,
(b) my job fails due to a python 2 incompatibility.

I've tried uninstalling python2 and also aliasing alias python='python3' in my init.sh script, but alas, no success. The alias doesn't seem to stick.

I create the cluster like this

cluster_config = {
    "projectId": self.project_id,
    "clusterName": cluster_name,
    "config": {
        "gceClusterConfig": gce_cluster_config,
        "masterConfig": master_config,
        "workerConfig": worker_config,
        "initializationActions": [
            [{
            "executableFile": executable_file_uri,
            "executionTimeout": execution_timeout,
        }]
        ],
    }
}

credentials = GoogleCredentials.get_application_default()
api = build('dataproc', 'v1', credentials=credentials)

response = api.projects().regions().clusters().create(
    projectId=self.project_id,
    region=self.region, body=cluster_config
).execute()

My executable_file_uri is sits on google storage; init.sh:

apt-get -y update
apt-get install -y python-dev
wget -O /root/get-pip.py https://bootstrap.pypa.io/get-pip.py
python /root/get-pip.py
apt-get install -y python-pip
pip install --upgrade pip
pip install --upgrade six
pip install --upgrade gcloud
pip install --upgrade requests
pip install numpy
like image 714
Roman Avatar asked Aug 23 '17 15:08

Roman


3 Answers

I found an answer to this here such that my initialization script now looks like this:

#!/bin/bash

# Install tools
apt-get -y install python3 python-dev build-essential python3-pip
easy_install3 -U pip

# Install requirements
pip3 install --upgrade google-cloud==0.27.0
pip3 install --upgrade google-api-python-client==1.6.2
pip3 install --upgrade pytz==2013.7

# Setup python3 for Dataproc
echo "export PYSPARK_PYTHON=python3" | tee -a  /etc/profile.d/spark_config.sh  /etc/*bashrc /usr/lib/spark/conf/spark-env.sh
echo "export PYTHONHASHSEED=0" | tee -a /etc/profile.d/spark_config.sh /etc/*bashrc /usr/lib/spark/conf/spark-env.sh
echo "spark.executorEnv.PYTHONHASHSEED=0" >> /etc/spark/conf/spark-defaults.conf
like image 71
Ajr Avatar answered Oct 23 '22 00:10

Ajr


Configure the Dataproc cluster's Python environment explained it in detail. Basically, you need init actions before 1.4, and the default is Python3 from Miniconda3 in 1.4+.

like image 2
Dagang Avatar answered Oct 23 '22 00:10

Dagang


You can also use the Conda init action to setup Python 3 and optionally install pip/conda packages: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/conda.

Something like:

gcloud dataproc clusters create foo --initialization-actions \ gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,gs://dataproc-initialization-actions/conda/install-conda-env.sh

like image 1
Karthik Palaniappan Avatar answered Oct 23 '22 01:10

Karthik Palaniappan