Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SparkUI for pyspark - corresponding line of code for each stage?

I have some pyspark program running on AWS cluster. I am monitoring the job through Spark UI (see attached). However, I noticed that unlike the scala or Java spark program, which shows each Stage is corresponding to which line of code, I can't find which Stage is corresponding to which line of code in the pyspark code.

Is there a way I can figure out which Stage is corresponding to which line of the pyspark code?

Thanks!

enter image description here

like image 797
Edamame Avatar asked Jul 11 '16 20:07

Edamame


People also ask

How PySpark code is executed?

Use small scripts and multiple environments in PySpark The normal flow is to read the data, transform the data and write the data. Often the write stage is the only place where you need to execute an action. Instead of debugging in the middle of the code, you can review the output of the whole PySpark job.

Which feature of Spark determines how your code is executed?

The execution is performed only when an action is performed on the new RDD and gives us a final result. So once you perform any action on RDD then spark context gives your program to the driver. The driver creates the DAG(Directed Acyclic Graph) or Execution plan(Job) for your program.

What is the difference between PySpark and Spark?

PySpark is a Python interface for Apache Spark that allows you to tame Big Data by combining the simplicity of Python with the power of Apache Spark. As we know Spark is built on Hadoop/HDFS and is mainly written in Scala, a functional programming language akin to Java.


1 Answers

When you run a toPandas call, the line in the python code is shown in the SQL tab. Other collect commands, such as count or parquet do not show the line number. I'm not sure why this is, but I find it can be very handy.

like image 53
Chogg Avatar answered Sep 23 '22 13:09

Chogg