Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jupyter & PySpark: How to run multiple notebooks

I am using Spark 1.6.0 on three VMs, 1x Master (standalone), 2x workers w/ 8G RAM, 2CPU each.

I am using the kernel configuration below:

{
 "display_name": "PySpark ",
 "language": "python3",
 "argv": [
  "/usr/bin/python3",
  "-m", 
  "IPython.kernel", 
  "-f",
  "{connection_file}"
 ],
 "env": {
  "SPARK_HOME": "<mypath>/spark-1.6.0",
  "PYTHONSTARTUP": "<mypath>/spark-1.6.0/python/pyspark/shell.py",
  "PYSPARK_SUBMIT_ARGS": "--master spark://<mymaster>:7077  --conf   spark.executor.memory=2G pyspark-shell --driver-class-path /opt/vertica/java/lib/vertica-jdbc.jar"
 }  
}  

Currently, this works. I can use spark context sc & sqlContext without import, as in pyspark shell.

Problem comes when I use multiple notebooks: On my spark master I see two 'pyspark-shell' apps, which kinda make sense, but only one can run at a time. But here, 'running' does not mean executing anything, even when I do not run anything on a notebook, this will be shown as 'running'. Given this, I can't share my resources between notebooks, which is quite sad (i currently have to kill the first shell (= notebook kernel) to run the second).

If you have any ideas about how to do it, tell me! Also, I'm not sure if the way i'm working with kernels is 'best practice', i already had trouble just setting spark & jupyter to work together.

Thx all

like image 922
pltrdy Avatar asked Mar 30 '16 14:03

pltrdy


People also ask

Is Jupyter and Python the same?

The Jupyter Notebook is not included with Python, so if you want to try it out, you will need to install Jupyter. There are many distributions of the Python language. This article will focus on just two of them for the purposes of installing Jupyter Notebook.

What is Jupyter for?

JupyterLab is the latest web-based interactive development environment for notebooks, code, and data. Its flexible interface allows users to configure and arrange workflows in data science, scientific computing, computational journalism, and machine learning.

Is Jupyter a IDE?

Jupyter notebook is an open-source IDE that is used to create Jupyter documents that can be created and shared with live codes. Also, it is a web-based interactive computational environment. The Jupyter notebook can support various languages that are popular in data science such as Python, Julia, Scala, R, etc.

Is Jupyter Python free?

Jupyter is a free, open-source, interactive web tool known as a computational notebook, which researchers can use to combine software code, computational output, explanatory text and multimedia resources in a single document.


1 Answers

The problem is the database used by Spark to store metastore (Derby). Derby is a light weight database system and can only run one Spark instance at a time. The solution is to setup another database system to deal with multi instances (postgres, mysql...).

For example, you can use postgres DB.

  • Add postgres jar in spark/jars
  • Add a config file (hive-site.xml) in spark conf
  • Install postgres on your machine
  • Add a user, password and db for spark/hive in postgres (depends on your values in hive-site.xml)

Example on a linux shell:

# download postgres jar
wget  https://jdbc.postgresql.org/download/postgresql-42.1.4.jar

# install postgres on your machine
pip install postgres

# add user, pass and db to postgres
psql -d postgres -c "create user hive"
psql -d postgres -c "alter user hive with password 'pass'"
psql -d postgres -c "create database hive_metastore"
psql -d postgres -c "grant all privileges on database hive_metastore to hive"

hive-site.xml:

<configuration>

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:postgresql://localhost:5432/hive_metastore</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.postgresql.Driver</value>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>pass</value>
</property>

</configuration>
like image 168
pcc Avatar answered Oct 09 '22 18:10

pcc