Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to estimate dataframe real size in pyspark?

Tags:

How to determine a dataframe size?

Right now I estimate the real size of a dataframe as follows:

headers_size = key for key in df.first().asDict()
rows_size = df.map(lambda row: len(value for key, value in row.asDict()).sum()
total_size = headers_size + rows_size

It is too slow and I'm looking for a better way.

like image 848
TheSilence Avatar asked May 06 '16 16:05

TheSilence


People also ask

How do you determine the size of a DataFrame in PySpark?

To obtain the shape of a data frame in PySpark, you can obtain the number of rows through "DF. count()" and the number of columns through "len(DF. columns)".

What is the size of data in Spark?

In terms of data size, Spark has been shown to work well up to petabytes. It has been used to sort 100 TB of data 3X faster than Hadoop MapReduce on 1/10th of the machines, winning the 2014 Daytona GraySort Benchmark, as well as to sort 1 PB.

How do I find the number of rows in a Spark data frame?

Count the number of distinct rows in pyspark – Get number of distinct rows. dataframe. distinct. count() function counts the number of distinct rows of dataframe.


1 Answers

nice post from Tamas Szuromi http://metricbrew.com/how-to-estimate-rdd-or-dataframe-real-size-in-pyspark/

from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
def _to_java_object_rdd(rdd):  
    """ Return a JavaRDD of Object by unpickling
    It will convert each Python object into Java object by Pyrolite, whenever the
    RDD is serialized in batch or not.
    """
    rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
    return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)

JavaObj = _to_java_object_rdd(df.rdd)

nbytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)
like image 151
Ziggy Eunicien Avatar answered Oct 01 '22 14:10

Ziggy Eunicien