How to find size (in MB) of dataframe in pyspark?

Question

How to find size (in MB) of dataframe in pyspark,

df = spark.read.json("/Filestore/tables/test.json")

I want to find how the size of df or test.json

L Co · Accepted Answer

Late answer, but since google brought me here first I figure I'll add this answer based on the comment by user @hiryu here.

This is tested and working for me. This requires caching, so probably is best kept to notebook development.

# Need to cache the table (and force the cache to happen)
df.cache()
df.count() # force caching

# need to access hidden parameters from the `SparkSession` and `DataFrame`
catalyst_plan = df._jdf.queryExecution().logical()
size_bytes = spark._jsparkSession.sessionState().executePlan(catalyst_plan).optimizedPlan().stats().sizeInBytes()

# always try to remember to free cached data once finished
df.unpersist()

print("Total table size: ", convert_size_bytes(size_bytes))

You need to access the hidden _jdf and _jSparkSession variables. Since Python objects do not expose the needed attributes directly, they won't be shown by IntelliSense.

Bonus:

My convert_size_bytes function looks like:

def convert_size_bytes(size_bytes):
    """
    Converts a size in bytes to a human readable string using SI units.
    """
    import math
    import sys

    if not isinstance(size_bytes, int):
        size_bytes = sys.getsizeof(size_bytes)

    if size_bytes == 0:
        return "0B"

    size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
    i = int(math.floor(math.log(size_bytes, 1024)))
    p = math.pow(1024, i)
    s = round(size_bytes / p, 2)
    return "%s %s" % (s, size_name[i])

GANGA EDIGA · Answer

We can use the explain to get the size.

df.explain('cost')

== Optimized Logical Plan ==
Relation [value#0] text, Statistics(sizeInBytes=24.3 KiB)

You can convert into MBs.

How to find size (in MB) of dataframe in pyspark?

Tags:

dataframe

scala

apache-spark

pyspark

databricks

Aravindh

2 Answers

Bonus:

L Co

GANGA EDIGA

Recent Activity

Donate For Us

How to find size (in MB) of dataframe in pyspark?

Tags:

dataframe

scala

apache-spark

pyspark

databricks

Aravindh

2 Answers

Bonus:

L Co

GANGA EDIGA

Related questions

Recent Activity

Donate For Us