This is my first question here after using a lot of StackOverflow so correct me if I give inaccurate or incomplete info
Up until this week I had a colab notebook setup to run with pyspark following one of the many guides I found throughout the internet, but this week it started popping with a few different errors.
The code used is pretty much this one:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-preview2-bin-hadoop2.7"
import findspark
findspark.init()
I have tried changing the Java version from 8 to 11 and using all of the available Spark builds on https://downloads.apache.org/spark/ and changing the HOME paths accordingly. I used pip freeze
as seen on one guide to check the Spark version used in colab and it said pyspark 3.0.0 so I tried all the ones on version 3.0.0 and all I keep getting is the error:
Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly
I don't understand much about the need of using Java for this, but I also tried installing pyj4 though !pip install py4j
and it says its already installed when I do, and I tried every different guide on the internet, but I can't run my Spark code anymore. Does anyone know how to fix this?
I only use colab for college purposes because my PC is quite outdated and I don't know much about it, but I really need to get this notebook running reliably and so how do I know it's time to update the imported builds?
Running Pyspark in ColabTo run spark in Colab, first we need to install all the dependencies in Colab environment such as Apache Spark 2.3. 2 with hadoop 2.7, Java 8 and Findspark in order to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab.
Firstly, let's talk about how to install Spark on Google Colab manually. Step 1.1: Download Java because Spark requires Java Virtual Machine (JVM). Step 1.2: Download the latest version of the Apache Spark following the steps below: Go to https://spark.apache.org/downloads.html.
Following this colab notebook which worked for me:
First cell:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
and that pretty much installs pyspark
.
But do follow these steps to also launch the Spark UI which is super-helpful for understanding physical plans, storage usage, and much more. Also: it has nice graphs ;)
Second cell:
from pyspark import SparkSession
from pyspark import SparkContext, SparkConf
# create the session
conf = SparkConf().set("spark.ui.port", "4050")
# create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()
Third cell:
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
get_ipython().system_raw('./ngrok http 4050 &')
!sleep 10
!curl -s http://localhost:4040/api/tunnels | python3 -c \
"import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"
after which you'll see a URL where you'll find the Spark UI; my example output was:
--2020-10-03 11:30:58-- https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
Resolving bin.equinox.io (bin.equinox.io)... 52.203.78.32, 52.73.16.193, 34.205.238.171, ...
Connecting to bin.equinox.io (bin.equinox.io)|52.203.78.32|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13773305 (13M) [application/octet-stream]
Saving to: ‘ngrok-stable-linux-amd64.zip.1’
ngrok-stable-linux- 100%[===================>] 13.13M 13.9MB/s in 0.9s
2020-10-03 11:31:00 (13.9 MB/s) - ‘ngrok-stable-linux-amd64.zip.1’ saved [13773305/13773305]
Archive: ngrok-stable-linux-amd64.zip
replace ngrok? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
inflating: ngrok
http://989c77d52223.ngrok.io
and that last element, http://989c77d52223.ngrok.io, was where my Spark UI lived.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With