Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I install pyspark for use in standalone scripts?

I'm am trying to use Spark with Python. I installed the Spark 1.0.2 for Hadoop 2 binary distribution from the downloads page. I can run through the quickstart examples in Python interactive mode, but now I'd like to write a standalone Python script that uses Spark. The quick start documentation says to just import pyspark, but this doesn't work because it's not on my PYTHONPATH.

I can run bin/pyspark and see that the module is installed beneath SPARK_DIR/python/pyspark. I can manually add this to my PYTHONPATH environment variable, but I'd like to know the preferred automated method.

What is the best way to add pyspark support for standalone scripts? I don't see a setup.py anywhere under the Spark install directory. How would I create a pip package for a Python script that depended on Spark?

like image 839
W.P. McNeill Avatar asked Aug 08 '14 13:08

W.P. McNeill


People also ask

Can I install PySpark without Spark?

PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities. so there is no PySpark library to download. All you need is Spark. Follow the below steps to Install PySpark on Windows.

How do I run PySpark scripts?

Spark environment provides a command to execute the application file, be it in Scala or Java(need a Jar format), Python and R programming file. The command is, $ spark-submit --master <url> <SCRIPTNAME>. py .


1 Answers

Spark-2.2.0 onwards use pip install pyspark to install pyspark in your machine.

For older versions refer following steps. Add Pyspark lib in Python path in the bashrc

export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH 

also don't forget to set up the SPARK_HOME. PySpark depends the py4j Python package. So install that as follows

pip install py4j 

For more details about stand alone PySpark application refer this post

like image 81
prabeesh Avatar answered Oct 09 '22 02:10

prabeesh