Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jupyter pyspark : no module named pyspark

Google is literally littered with solutions to this problem, but unfortunately even after trying out all the possibilities, am unable to get it working, so please bear with me and see if something strikes you.

OS: MAC

Spark : 1.6.3 (2.10)

Jupyter Notebook : 4.4.0

Python : 2.7

Scala : 2.12.1

I was able to successfully install and run Jupyter notebook. Next, i tried configuring it to work with Spark, for which i installed spark interpreter using Apache Toree. Now when i try running any RDD operation in notebook, following error is thrown

Error from python worker:
  /usr/bin/python: No module named pyspark
PYTHONPATH was:
  /private/tmp/hadoop-xxxx/nm-local-dir/usercache/xxxx/filecache/33/spark-assembly-1.6.3-hadoop2.2.0.jar

Things already tried: 1. Set PYTHONPATH in .bash_profile 2. Am able to import 'pyspark' in python-cli on local 3. Have tried updating interpreter kernel.json to following

{
  "language": "python",
  "display_name": "Apache Toree - PySpark",
  "env": {
    "__TOREE_SPARK_OPTS__": "",
    "SPARK_HOME": "/Users/xxxx/Desktop/utils/spark",
    "__TOREE_OPTS__": "",
    "DEFAULT_INTERPRETER": "PySpark",
    "PYTHONPATH": "/Users/xxxx/Desktop/utils/spark/python:/Users/xxxx/Desktop/utils/spark/python/lib/py4j-0.9-src.zip:/Users/xxxx/Desktop/utils/spark/python/lib/pyspark.zip:/Users/xxxx/Desktop/utils/spark/bin",
  "PYSPARK_SUBMIT_ARGS": "--master local --conf spark.serializer=org.apache.spark.serializer.KryoSerializer",
    "PYTHON_EXEC": "python"
  },
  "argv": [
    "/usr/local/share/jupyter/kernels/apache_toree_pyspark/bin/run.sh",
    "--profile",
    "{connection_file}"
  ]
}
  1. Have even updated interpreter run.sh to explicitly load py4j-0.9-src.zip and pyspark.zip files. When the opening the PySpark notebook, and creating of SparkContext, I can see the spark-assembly, py4j and pyspark packages being uploaded from local, but still when an action is invoked, somehow pyspark is not found.
like image 539
Saurabh Mishra Avatar asked Feb 03 '17 17:02

Saurabh Mishra


2 Answers

Use findspark lib to bypass all environment setting up process. Here is the link for more information. https://github.com/minrk/findspark

Use it as below.

import findspark
findspark.init('/path_to_spark/spark-x.x.x-bin-hadoopx.x')
from pyspark.sql import SparkSession
like image 96
kay Avatar answered Oct 21 '22 23:10

kay


I tried the following command in Windows to link pyspark on jupyter.

On *nix, use export instead of set

Type below code in CMD/Command Prompt

set PYSPARK_DRIVER_PYTHON=ipython
set PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark
like image 22
furianpandit Avatar answered Oct 21 '22 23:10

furianpandit