How to find size (in MB) of dataframe in pyspark,
df = spark.read.json("/Filestore/tables/test.json")
I want to find how the size of df
or test.json
Late answer, but since google brought me here first I figure I'll add this answer based on the comment by user @hiryu here.
This is tested and working for me. This requires caching, so probably is best kept to notebook development.
# Need to cache the table (and force the cache to happen)
df.cache()
df.count() # force caching
# need to access hidden parameters from the `SparkSession` and `DataFrame`
catalyst_plan = df._jdf.queryExecution().logical()
size_bytes = spark._jsparkSession.sessionState().executePlan(catalyst_plan).optimizedPlan().stats().sizeInBytes()
# always try to remember to free cached data once finished
df.unpersist()
print("Total table size: ", convert_size_bytes(size_bytes))
You need to access the hidden
_jdf
and_jSparkSession
variables. Since Python objects do not expose the needed attributes directly, they won't be shown by IntelliSense.
My convert_size_bytes
function looks like:
def convert_size_bytes(size_bytes):
"""
Converts a size in bytes to a human readable string using SI units.
"""
import math
import sys
if not isinstance(size_bytes, int):
size_bytes = sys.getsizeof(size_bytes)
if size_bytes == 0:
return "0B"
size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
i = int(math.floor(math.log(size_bytes, 1024)))
p = math.pow(1024, i)
s = round(size_bytes / p, 2)
return "%s %s" % (s, size_name[i])
We can use the explain
to get the size.
df.explain('cost')
== Optimized Logical Plan ==
Relation [value#0] text, Statistics(sizeInBytes=24.3 KiB)
You can convert into MBs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With