Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

importing pyspark in python shell

This is a copy of someone else's question on another forum that was never answered, so I thought I'd re-ask it here, as I have the same issue. (See http://geekple.com/blogs/feeds/Xgzu7/posts/351703064084736)

I have Spark installed properly on my machine and am able to run python programs with the pyspark modules without error when using ./bin/pyspark as my python interpreter.

However, when I attempt to run the regular Python shell, when I try to import pyspark modules I get this error:

from pyspark import SparkContext 

and it says

"No module named pyspark". 

How can I fix this? Is there an environment variable I need to set to point Python to the pyspark headers/libraries/etc.? If my spark installation is /spark/, which pyspark paths do I need to include? Or can pyspark programs only be run from the pyspark interpreter?

like image 914
Glenn Strycker Avatar asked Apr 23 '14 22:04

Glenn Strycker


People also ask

How do you use Pyspark in shell?

Go to the Spark Installation directory from the command line and type bin/pyspark and press enter, this launches pyspark shell and gives you a prompt to interact with Spark in Python language. If you have set the Spark in a PATH then just enter pyspark in command line or terminal (mac users).

How do I run Pyspark in Python?

PySpark Shell Another PySpark-specific way to run your programs is using the shell provided with PySpark itself. Again, using the Docker setup, you can connect to the container's CLI as described above. Then, you can run the specialized Python shell with the following command: $ /usr/local/spark/bin/pyspark Python 3.7.

Can you use Python in Spark shell?

Basics. Spark's shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries) or Python.

Is Pyspark a library in Python?

PySpark is a Python API for Spark released by the Apache Spark community to support Python with Spark. Using PySpark, one can easily integrate and work with RDDs in Python programming language too. There are numerous features that make PySpark such an amazing framework when it comes to working with huge datasets.


1 Answers

Assuming one of the following:

  • Spark is downloaded on your system and you have an environment variable SPARK_HOME pointing to it
  • You have ran pip install pyspark

Here is a simple method (If you don't bother about how it works!!!)

Use findspark

  1. Go to your python shell

    pip install findspark  import findspark findspark.init() 
  2. import the necessary modules

    from pyspark import SparkContext from pyspark import SparkConf 
  3. Done!!!

like image 180
Suresh2692 Avatar answered Sep 16 '22 16:09

Suresh2692