Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to profile pyspark jobs

I want to understand profiling in pyspark codes.

Following this: https://github.com/apache/spark/pull/2351

>>> sc._conf.set("spark.python.profile", "true")
>>> rdd = sc.parallelize(range(100)).map(str)
>>> rdd.count()
100
>>> sc.show_profiles()
============================================================
Profile of RDD<id=1>
============================================================
         284 function calls (276 primitive calls) in 0.001 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
        4    0.000    0.000    0.000    0.000 {reduce}
     12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
        4    0.000    0.000    0.000    0.000 {cPickle.loads}
        4    0.000    0.000    0.000    0.000 {cPickle.dumps}
      104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
        8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
       12    0.000    0.000    0.000    0.000 rdd.py:303(func)

Above works great. But If I do something like below:

from pyspark.sql import HiveContext
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf().setAppName("myapp").set("spark.python.profile","true")
sc   = SparkContext(conf=conf)
sqlContext = HiveContext(sc)

df=sqlContext.sql("select * from myhivetable")
df.count()
sc.show_profiles()

This does not give me anything. I get the count but show_profiles() give me None

Any help appreciated

like image 396
sau Avatar asked Aug 31 '16 15:08

sau


People also ask

Is PySpark easy?

It is user-friendly as it has APIs written in popular languages which makes it easy for your developers because they hide the complexity of distributed processing behind simple, high-level operators that dramatically lowers the amount of code required.


1 Answers

There is no Python code to profile when you use Spark SQL. The only Python is to call Scala engine. Everything else is executed on Java Virtual Machine.

like image 108
user6022341 Avatar answered Oct 17 '22 06:10

user6022341