Jupyter & PySpark: How to run multiple notebooks

Tags:

I am using Spark 1.6.0 on three VMs, 1x Master (standalone), 2x workers w/ 8G RAM, 2CPU each.

I am using the kernel configuration below:

{
 "display_name": "PySpark ",
 "language": "python3",
 "argv": [
  "/usr/bin/python3",
  "-m", 
  "IPython.kernel", 
  "-f",
  "{connection_file}"
 ],
 "env": {
  "SPARK_HOME": "<mypath>/spark-1.6.0",
  "PYTHONSTARTUP": "<mypath>/spark-1.6.0/python/pyspark/shell.py",
  "PYSPARK_SUBMIT_ARGS": "--master spark://<mymaster>:7077  --conf   spark.executor.memory=2G pyspark-shell --driver-class-path /opt/vertica/java/lib/vertica-jdbc.jar"
 }  
}

Currently, this works. I can use spark context sc & sqlContext without import, as in pyspark shell.

Problem comes when I use multiple notebooks: On my spark master I see two 'pyspark-shell' apps, which kinda make sense, but only one can run at a time. But here, 'running' does not mean executing anything, even when I do not run anything on a notebook, this will be shown as 'running'. Given this, I can't share my resources between notebooks, which is quite sad (i currently have to kill the first shell (= notebook kernel) to run the second).

If you have any ideas about how to do it, tell me! Also, I'm not sure if the way i'm working with kernels is 'best practice', i already had trouble just setting spark & jupyter to work together.

Thx all

922

asked Mar 30 '16 14:03

pltrdy

1 Answers

The problem is the database used by Spark to store metastore (Derby). Derby is a light weight database system and can only run one Spark instance at a time. The solution is to setup another database system to deal with multi instances (postgres, mysql...).

For example, you can use postgres DB.

Add postgres jar in spark/jars
Add a config file (hive-site.xml) in spark conf
Install postgres on your machine
Add a user, password and db for spark/hive in postgres (depends on your values in hive-site.xml)

Example on a linux shell:

# download postgres jar
wget  https://jdbc.postgresql.org/download/postgresql-42.1.4.jar

# install postgres on your machine
pip install postgres

# add user, pass and db to postgres
psql -d postgres -c "create user hive"
psql -d postgres -c "alter user hive with password 'pass'"
psql -d postgres -c "create database hive_metastore"
psql -d postgres -c "grant all privileges on database hive_metastore to hive"

hive-site.xml:

<configuration>

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:postgresql://localhost:5432/hive_metastore</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.postgresql.Driver</value>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>pass</value>
</property>

</configuration>

168

answered Oct 09 '22 18:10

pcc

Related questions
                            
                                Spark SQL exception handling
                            
                                Spark driver pod getting killed with 'OOMKilled' status
                            
                                Is Tachyon by default implemented by the RDD's in Apache Spark?
                            
                                Spark DataFrame: operate on groups
                            
                                pyspark : how to check if a file exists in hdfs
                            
                                Scope of 'spark.driver.maxResultSize'
                            
                                Making spark use /etc/hosts file for binding in YARN cluster mode
                            
                                Spark serialization error mystery
                            
                                Spark: More Efficient Aggregation to join strings from different rows
                            
                                Spark SQL performance: version 1.6 vs version 1.5
                            
                                What's the limit to spark streaming in terms of data amount?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Jupyter & PySpark: How to run multiple notebooks

Tags:

jupyter

apache-spark

pyspark

pltrdy

People also ask

1 Answers

pcc

Recent Activity

Donate For Us