Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get iPython inbuild magic command to work in Jupyter notebook Pyspark kernel?

I am using PySpark kernel installed through Apache Toree in Jupyter Notebook using Anaconda v4.0.0 (Python 2.7.11). After getting a table from Hive, use matplotlib/panda to plot some graph in Jupyter notebook, following the tutorial as below:

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set some Pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 25)

normals = pd.Series(np.random.normal(size=10))
normals.plot()

I was stuck at the first link when I tried to use %matplotlib inline which shows

Name: Error parsing magics!
Message: Magics [matplotlib] do not exist!
StackTrace:

Looking at Toree Magic and MagicManager, I realised that %matplotlib is calling MagicManager instead of the iPython in-build magic command.

Is it possible for Apache Toree - PySpark to use iPython in-build magic command instead?

like image 360
Angletear Avatar asked Sep 19 '16 09:09

Angletear


People also ask

How do I add Pyspark kernel to Jupyter Notebook?

Create a new kernel and point it to the root env in each project. To do so create a directory 'pyspark' in /opt/wakari/wakari-compute/share/jupyter/kernels/ . You may choose any name for the 'display_name'. This configuration is pointing to the python executable in the root environment.

How do you run a command in a Jupyter Notebook?

You can run the notebook document step-by-step (one cell a time) by pressing shift + enter. You can run the whole notebook in a single step by clicking on the menu Cell -> Run All. To restart the kernel (i.e. the computational engine), click on the menu Kernel -> Restart.


1 Answers

I did a workaround hack for PySpark and magic command to work, instead of installing Toree PySpark kernel I am using PySpark directly on Jupyter Notebook.

  1. Download and install Anaconda2 4.0.0

  2. Download Spark 1.6.0 pre-built for Hadoop 2.6

  3. Append ~/.bashrc with the following commands and enter source ~/.bashrc to update environment variables

    # added to run spark
    export PATH="{your_spark_dir}spark/sbin:$PATH"
    export PATH="{your_spark_dir}spark/bin:$PATH"

    # added to launch spark application in cluster mode
    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre

    # next 2 lines are optional, needed only Spark Cluster export HADOOP_CONF_DIR={your_hadoop_conf}/hadoop-conf
    export YARN_CONF_DIR={your_hadoop_conf}/hadoop-conf

    # added by Anaconda2 4.0.0 installer
    export PATH="{your_anaconda_dir}/Anaconda/bin:$PATH"

    # added to run pyspark in jupyter notebook
    export PYSPARK_DRIVER_PYTHON={your_anaconda_dir}/Anaconda/bin/jupyter
    export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=False --NotebookApp.ip='0.0.0.0' --NotebookApp.port=8888"
    export PYSPARK_PYTHON={your_anaconda_dir}/Anaconda/bin/python

Running the Jupyter Notebook

  1. pyspark --master=yarn --deploy-mode=client to start the notebook running PySpark in cluster mode

  2. Open a browser and enter IP_ADDRESS_OF_COMPUTER:8888

Disclaimer
This is only a workaround and not an actual way of fixing the problem please let me know if you found a way for Toree PySpark ipython inbuild magic command to work. Magic command such as %matplotlib notebook

like image 164
Angletear Avatar answered Jan 02 '23 12:01

Angletear