For python dataframe, info() function provides memory usage. Is there any equivalent in pyspark ? Thanks
RM UI also displays the total memory per application. Spark UI - Checking the spark ui is not practical in our case. RM UI - Yarn UI seems to display the total memory consumption of spark app that has executors and driver.
The memory_usage() method gives us the total memory being used by each column in the dataframe. It returns a Pandas series which lists the space being taken up by each column in bytes. Passing the deep argument a value = True within the memory_usage() method gives us the total memory usage of the dataframe columns.
Similar to Python Pandas you can get the Size and Shape of the PySpark (Spark with Python) DataFrame by running count() action to get the number of rows on DataFrame and len(df. columns()) to get the number of columns.
I have something in mind, its just a rough estimation. as far as i know spark doesn't have a straight forward way to get dataframe memory usage, But Pandas dataframe does. so what you can do is.
sample = df.sample(fraction = 0.01)
pdf = sample.toPandas()
pdf.info()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With