How to estimate dataframe real size in pyspark?

Tags:

How to determine a dataframe size?

Right now I estimate the real size of a dataframe as follows:

headers_size = key for key in df.first().asDict()
rows_size = df.map(lambda row: len(value for key, value in row.asDict()).sum()
total_size = headers_size + rows_size

It is too slow and I'm looking for a better way.

848

asked May 06 '16 16:05

TheSilence

1 Answers

nice post from Tamas Szuromi http://metricbrew.com/how-to-estimate-rdd-or-dataframe-real-size-in-pyspark/

from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
def _to_java_object_rdd(rdd):  
    """ Return a JavaRDD of Object by unpickling
    It will convert each Python object into Java object by Pyrolite, whenever the
    RDD is serialized in batch or not.
    """
    rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
    return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)

JavaObj = _to_java_object_rdd(df.rdd)

nbytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)

151

answered Oct 01 '22 14:10

Ziggy Eunicien

Related questions
                            
                                Does aws s3 sync count as requests?
                            
                                apt install unable to locate executable
                            
                                JPA findBy field ignore case
                            
                                Handling oauth2 redirect from electron (or other desktop platforms)
                            
                                Tensorflow, multi label accuracy calculation
                            
                                Akka Streams: State in a flow
                            
                                Preserve names when coercing vector from binary to `as.numeric`?
                            
                                Firebase console: How to specify click_action for notifications
                            
                                How to protect .env file in Laravel
                            
                                How to list all tasks for the master project only in gradle?
                            
                                How to use ansible 'expect' module for multiple different responses?
                            
                                Is project.json deprecated?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With