Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running pyspark after pip install pyspark

Tags:

pip

pyspark

I wanted to install pyspark on my home machine. I did

pip install pyspark
pip install jupyter

Both seemed to work well.

But when I try to run pyspark I get

pyspark
Could not find valid SPARK_HOME while searching ['/home/user', '/home/user/.local/bin']

What should SPARK_HOME be set to?

like image 688
graffe Avatar asked Sep 18 '17 19:09

graffe


People also ask

How do I activate PySpark?

Go to the Spark Installation directory from the command line and type bin/pyspark and press enter, this launches pyspark shell and gives you a prompt to interact with Spark in Python language. If you have set the Spark in a PATH then just enter pyspark in command line or terminal (mac users).

Can you pip install PySpark?

For Python users, PySpark also provides pip installation from PyPI. This is usually for local usage or as a client to connect to a cluster instead of setting up a cluster itself. This page includes instructions for installing PySpark by using pip, Conda, downloading manually, and building from the source.

How do I know if PySpark is installed?

To test if your installation was successful, open Command Prompt, change to SPARK_HOME directory and type bin\pyspark. This should start the PySpark shell which can be used to interactively work with Spark.


1 Answers

I just faced the same issue, but it turned out that pip install pyspark downloads spark distirbution that works well in local mode. Pip just doesn't set appropriate SPARK_HOME. But when I set this manually, pyspark works like a charm (without downloading any additional packages).

$ pip3 install --user pyspark
Collecting pyspark
  Downloading pyspark-2.3.0.tar.gz (211.9MB)
    100% |████████████████████████████████| 211.9MB 9.4kB/s 
Collecting py4j==0.10.6 (from pyspark)
  Downloading py4j-0.10.6-py2.py3-none-any.whl (189kB)
    100% |████████████████████████████████| 194kB 3.9MB/s 
Building wheels for collected packages: pyspark
  Running setup.py bdist_wheel for pyspark ... done
  Stored in directory: /home/mario/.cache/pip/wheels/4f/39/ba/b4cb0280c568ed31b63dcfa0c6275f2ffe225eeff95ba198d6
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.6 pyspark-2.3.0

$ PYSPARK_PYTHON=python3 SPARK_HOME=~/.local/lib/python3.5/site-packages/pyspark pyspark
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
2018-03-31 14:02:39 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Python version 3.5.2 (default, Nov 23 2017 16:37:01)
>>>
like image 138
Mariusz Avatar answered Oct 10 '22 12:10

Mariusz