Link Spark with iPython Notebook

Tags:

I have followed some tutorial online but they do not work with Spark 1.5.1 on OS X El Capitan (10.11)

Basically I have run this commands download apache-spark

brew update
brew install scala
brew install apache-spark

updated the .bash_profile

# For a ipython notebook and pyspark integration
if which pyspark > /dev/null; then
  export SPARK_HOME="/usr/local/Cellar/apache-spark/1.5.1/libexec/"
  export PYSPARK_SUBMIT_ARGS="--master local[2]"
fi

run

ipython profile create pyspark

created a startup file ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py configured in this way

# Configure the necessary Spark environment
import os
import sys

# Spark home
spark_home = os.environ.get("SPARK_HOME")

# If Spark V1.4.x is detected, then add ' pyspark-shell' to
# the end of the 'PYSPARK_SUBMIT_ARGS' environment variable
spark_release_file = spark_home + "/RELEASE"
if os.path.exists(spark_release_file) and "Spark 1.4" in open(spark_release_file).read():
    pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
    if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
    os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

# Add the spark python sub-directory to the path
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))

# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, "python/pyspark/shell.py"))

I then run ipython notebook --profile=pyspark and the notebook works fine, but the sc (spark context) is not recognised.

Anyone managed to do this with Spark 1.5.1?

EDIT: you can follow this guide to have it working

https://gist.github.com/tommycarpi/f5a67c66a8f2170e263c

924

asked Oct 11 '15 10:10

r4id4

1 Answers

I have Jupyter installed, and indeed It is simpler than you think:

Install anaconda for OSX.
Install jupyter typing the next line in your terminal Click me for more info.
```
ilovejobs@mymac:~$ conda install jupyter
```

Update jupyter just in case.

ilovejobs@mymac:~$ conda update jupyter

Download Apache Spark and compile it, or download and uncompress Apache Spark 1.5.1 + Hadoop 2.6.

ilovejobs@mymac:~$ cd Downloads 
ilovejobs@mymac:~/Downloads$ wget http://www.apache.org/dyn/closer.lua/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.6.tgz

Create an Apps folder on your home (i.e):

ilovejobs@mymac:~/Downloads$ mkdir ~/Apps

Move the uncompressed folder spark-1.5.1 to the ~/Apps directory.
```
ilovejobs@mymac:~/Downloads$ mv spark-1.5.1/ ~/Apps
```

Move to the ~/Apps directory and verify that spark is there.

ilovejobs@mymac:~/Downloads$ cd ~/Apps
ilovejobs@mymac:~/Apps$ ls -l
drwxr-xr-x ?? ilovejobs ilovejobs 4096 ?? ?? ??:?? spark-1.5.1

Here is the first tricky part. Add the spark binaries to your $PATH:

ilovejobs@mymac:~/Apps$ cd
ilovejobs@mymac:~$ echo "export $HOME/apps/spark/bin:$PATH" >> .profile

Here is the second tricky part. Add this environment variables also:

ilovejobs@mymac:~$ echo "export PYSPARK_DRIVER_PYTHON=ipython" >> .profile
ilovejobs@mymac:~$ echo "export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark" >> .profile

Source the profile to make these variables available for this terminal
```
ilovejobs@mymac:~$ source .profile
```
Create a ~/notebooks directory.
```
ilovejobs@mymac:~$ mkdir notebooks
```

Move to ~/notebooks and run pyspark:

ilovejobs@mymac:~$ cd notebooks
ilovejobs@mymac:~/notebooks$ pyspark

Notice that you can add those variables to the .bashrc located in your home. Now be happy, You should be able to run jupyter with a pyspark kernel (It will show it as a python 2 but it will use spark)

121

answered Oct 18 '22 09:10

Alberto Bonsanto

Related questions
                            
                                Spark Error: Not enough space to cache partition rdd_8_2 in memory! Free memory is 58905314 bytes
                            
                                Spark when union a lot of RDD throws stack overflow error
                            
                                Spark SQL filter multiple fields
                            
                                Use Spark to list all files in a Hadoop HDFS directory?
                            
                                Apache Drill vs Spark [closed]
                            
                                Building a StructType from a dataframe in pyspark
                            
                                How to select last row and also how to access PySpark dataframe by index?
                            
                                How to connect to remote hive server from spark [duplicate]
                            
                                Is dataframe.show() an action in spark?
                            
                                dynamically bind variable/parameter in Spark SQL?
                            
                                Spark UI on AWS EMR
                            
                                How to fix java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List to field type scala.collection.Seq?
                            
                                Why does Scala compiler fail with "no ': _*' annotation allowed here" when Row does accept varargs?
                            
                                Scala Error: Could not find or load main class in both Scala IDE and Eclipse
                            
                                How to configure Apache Spark random worker ports for tight firewalls?
                            
                                Where is the Spark UI on Google Dataproc?
                            
                                How to convert ArrayType to DenseVector in PySpark DataFrame?
                            
                                Executing separate streaming queries in spark structured streaming
                            
                                Unable to run a basic GraphFrames example
                            
                                unexpected type: <class 'pyspark.sql.types.DataTypeSingleton'> when casting to Int on a ApacheSpark Dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Link Spark with iPython Notebook

Tags:

ipython

osx-elcapitan

ipython-notebook

apache-spark

pyspark

r4id4

People also ask

1 Answers

Alberto Bonsanto

Recent Activity

Donate For Us