Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using pyspark in Google Colab

This is my first question here after using a lot of StackOverflow so correct me if I give inaccurate or incomplete info

Up until this week I had a colab notebook setup to run with pyspark following one of the many guides I found throughout the internet, but this week it started popping with a few different errors.

The code used is pretty much this one:

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-preview2-bin-hadoop2.7"
import findspark
findspark.init()

I have tried changing the Java version from 8 to 11 and using all of the available Spark builds on https://downloads.apache.org/spark/ and changing the HOME paths accordingly. I used pip freeze as seen on one guide to check the Spark version used in colab and it said pyspark 3.0.0 so I tried all the ones on version 3.0.0 and all I keep getting is the error:

Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly

I don't understand much about the need of using Java for this, but I also tried installing pyj4 though !pip install py4j and it says its already installed when I do, and I tried every different guide on the internet, but I can't run my Spark code anymore. Does anyone know how to fix this? I only use colab for college purposes because my PC is quite outdated and I don't know much about it, but I really need to get this notebook running reliably and so how do I know it's time to update the imported builds?

like image 713
Victor Régis Avatar asked Aug 09 '20 06:08

Victor Régis


People also ask

Can I use PySpark in Google Colab?

Running Pyspark in ColabTo run spark in Colab, first we need to install all the dependencies in Colab environment such as Apache Spark 2.3. 2 with hadoop 2.7, Java 8 and Findspark in order to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab.

Can we run Spark on Google Colab?

Firstly, let's talk about how to install Spark on Google Colab manually. Step 1.1: Download Java because Spark requires Java Virtual Machine (JVM). Step 1.2: Download the latest version of the Apache Spark following the steps below: Go to https://spark.apache.org/downloads.html.


1 Answers

Following this colab notebook which worked for me:

First cell:

!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

and that pretty much installs pyspark.

But do follow these steps to also launch the Spark UI which is super-helpful for understanding physical plans, storage usage, and much more. Also: it has nice graphs ;)

Second cell:

from pyspark import SparkSession
from pyspark import SparkContext, SparkConf



# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

Third cell:

!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
get_ipython().system_raw('./ngrok http 4050 &')
!sleep 10
!curl -s http://localhost:4040/api/tunnels | python3 -c \
    "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

after which you'll see a URL where you'll find the Spark UI; my example output was:

--2020-10-03 11:30:58--  https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
Resolving bin.equinox.io (bin.equinox.io)... 52.203.78.32, 52.73.16.193, 34.205.238.171, ...
Connecting to bin.equinox.io (bin.equinox.io)|52.203.78.32|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13773305 (13M) [application/octet-stream]
Saving to: ‘ngrok-stable-linux-amd64.zip.1’

ngrok-stable-linux- 100%[===================>]  13.13M  13.9MB/s    in 0.9s    

2020-10-03 11:31:00 (13.9 MB/s) - ‘ngrok-stable-linux-amd64.zip.1’ saved [13773305/13773305]

Archive:  ngrok-stable-linux-amd64.zip
replace ngrok? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: ngrok                   
http://989c77d52223.ngrok.io

and that last element, http://989c77d52223.ngrok.io, was where my Spark UI lived.

like image 108
ponadto Avatar answered Sep 23 '22 17:09

ponadto