How can I monitor the progress of a job through the Spark WEB UI? Running Spark locally, I can access Spark UI through the port 4040, using http://localhost:4040.
Setting up PySpark in ColabSpark is written in the Scala programming language and requires the Java Virtual Machine (JVM) to run. Therefore, our first task is to download Java. Next, we will download and unzip Apache Spark with Hadoop 2.7 to install it.
Apache Spark provides a suite of web user interfaces (UIs) that you can use to monitor the status and resource consumption of your Spark cluster.
Following this colab notebook you can do the following.
First, configure the Spark UI and start a Spark session:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
conf = SparkConf().set('spark.ui.port', '4050')
sc = SparkContext(conf=conf)
spark = SparkSession.builder.master('local[*]').getOrCreate()
In the next cell run:
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
get_ipython().system_raw('./ngrok http 4050 &')
which will install ngrok
and create a URL through which you can access the Spark UI (wait 10sec for it to start).
Now, to access the URL, call:
!curl -s http://localhost:4040/api/tunnels
which prints out a JSON that looks something like this (truncated):
{"tunnels":[{"name":"command_line","uri":"/api/tunnels/command_line","public_url":"https://1b881e94406c.ngrok.io","proto":"https", ... }
-- you're looking for the this "public_url"
above, that's your Spark UI's URL.
Or, run this:
!curl -s http://localhost:4040/api/tunnels | python3 -c "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"
I've tested it and it works for me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With