How to profile pyspark jobs

Tags:

I want to understand profiling in pyspark codes.

Following this: https://github.com/apache/spark/pull/2351

>>> sc._conf.set("spark.python.profile", "true")
>>> rdd = sc.parallelize(range(100)).map(str)
>>> rdd.count()
100
>>> sc.show_profiles()
============================================================
Profile of RDD<id=1>
============================================================
         284 function calls (276 primitive calls) in 0.001 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
        4    0.000    0.000    0.000    0.000 {reduce}
     12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
        4    0.000    0.000    0.000    0.000 {cPickle.loads}
        4    0.000    0.000    0.000    0.000 {cPickle.dumps}
      104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
        8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
       12    0.000    0.000    0.000    0.000 rdd.py:303(func)

Above works great. But If I do something like below:

from pyspark.sql import HiveContext
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf().setAppName("myapp").set("spark.python.profile","true")
sc   = SparkContext(conf=conf)
sqlContext = HiveContext(sc)

df=sqlContext.sql("select * from myhivetable")
df.count()
sc.show_profiles()

This does not give me anything. I get the count but show_profiles() give me None

Any help appreciated

396

asked Aug 31 '16 15:08

1 Answers

There is no Python code to profile when you use Spark SQL. The only Python is to call Scala engine. Everything else is executed on Java Virtual Machine.

108

answered Oct 17 '22 06:10

user6022341

Related questions
                            
                                Accessing HBase tables through Spark
                            
                                Running Spark on AWS EMR, how to run driver on master node?
                            
                                how can you calculate the size of an apache spark data frame using pyspark?
                            
                                Spark 2.3 submit on Kubernetes error
                            
                                Does Spark lock the File while writing to HDFS or S3
                            
                                Merge Schema with int and double cannot be resolved when reading parquet file
                            
                                How to filter a dataset according to datetime values in Spark
                            
                                Accumulator fails on cluster, works locally
                            
                                Make YARN clean up appcache before retry
                            
                                Build stateful chain for different events and assign global ID in spark
                            
                                Unable to connect Google Storage file using GSC connector from Spark
                            
                                Spark - Serializing an object with a non-serializable member
                            
                                org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 in stage 11.0 failed 4 times
                            
                                BigQuery connector for pyspark via Hadoop Input Format example
                            
                                Spark: Find pairs having at least n common attributes?
                            
                                How to show the spark progress bar in Jupyter notebook (using pyspark)
                            
                                Spark 2.3 Memory Leak on Executor
                            
                                Is Apache Spark less accurate than Scikit Learn?
                            
                                .sparkstaging directory in hdfs is not deleted
                            
                                Big data signal analysis: better way to store and query signal data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to profile pyspark jobs

Tags:

profiler

apache-spark

apache-spark-sql

pyspark

spark-dataframe

sau

People also ask

1 Answers

user6022341

Recent Activity

Donate For Us