I am using Spark 1.6.0 on three VMs, 1x Master (standalone), 2x workers w/ 8G RAM, 2CPU each.
I am using the kernel configuration below:
{
"display_name": "PySpark ",
"language": "python3",
"argv": [
"/usr/bin/python3",
"-m",
"IPython.kernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "<mypath>/spark-1.6.0",
"PYTHONSTARTUP": "<mypath>/spark-1.6.0/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "--master spark://<mymaster>:7077 --conf spark.executor.memory=2G pyspark-shell --driver-class-path /opt/vertica/java/lib/vertica-jdbc.jar"
}
}
Currently, this works. I can use spark context sc
& sqlContext
without import, as in pyspark shell.
Problem comes when I use multiple notebooks: On my spark master I see two 'pyspark-shell' apps, which kinda make sense, but only one can run at a time. But here, 'running' does not mean executing anything, even when I do not run anything on a notebook, this will be shown as 'running'. Given this, I can't share my resources between notebooks, which is quite sad (i currently have to kill the first shell (= notebook kernel) to run the second).
If you have any ideas about how to do it, tell me! Also, I'm not sure if the way i'm working with kernels is 'best practice', i already had trouble just setting spark & jupyter to work together.
Thx all
The Jupyter Notebook is not included with Python, so if you want to try it out, you will need to install Jupyter. There are many distributions of the Python language. This article will focus on just two of them for the purposes of installing Jupyter Notebook.
JupyterLab is the latest web-based interactive development environment for notebooks, code, and data. Its flexible interface allows users to configure and arrange workflows in data science, scientific computing, computational journalism, and machine learning.
Jupyter notebook is an open-source IDE that is used to create Jupyter documents that can be created and shared with live codes. Also, it is a web-based interactive computational environment. The Jupyter notebook can support various languages that are popular in data science such as Python, Julia, Scala, R, etc.
Jupyter is a free, open-source, interactive web tool known as a computational notebook, which researchers can use to combine software code, computational output, explanatory text and multimedia resources in a single document.
The problem is the database used by Spark to store metastore (Derby). Derby is a light weight database system and can only run one Spark instance at a time. The solution is to setup another database system to deal with multi instances (postgres, mysql...).
For example, you can use postgres DB.
Example on a linux shell:
# download postgres jar
wget https://jdbc.postgresql.org/download/postgresql-42.1.4.jar
# install postgres on your machine
pip install postgres
# add user, pass and db to postgres
psql -d postgres -c "create user hive"
psql -d postgres -c "alter user hive with password 'pass'"
psql -d postgres -c "create database hive_metastore"
psql -d postgres -c "grant all privileges on database hive_metastore to hive"
hive-site.xml:
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://localhost:5432/hive_metastore</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>pass</value>
</property>
</configuration>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With