Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find pyspark dataframe memory usage?

For python dataframe, info() function provides memory usage. Is there any equivalent in pyspark ? Thanks

like image 348
Neo Avatar asked Sep 14 '17 20:09

Neo


People also ask

How do I check my PySpark memory usage?

RM UI also displays the total memory per application. Spark UI - Checking the spark ui is not practical in our case. RM UI - Yarn UI seems to display the total memory consumption of spark app that has executors and driver.

How do I check Dataframe memory usage?

The memory_usage() method gives us the total memory being used by each column in the dataframe. It returns a Pandas series which lists the space being taken up by each column in bytes. Passing the deep argument a value = True within the memory_usage() method gives us the total memory usage of the dataframe columns.

How do I check the size of a PySpark Dataframe?

Similar to Python Pandas you can get the Size and Shape of the PySpark (Spark with Python) DataFrame by running count() action to get the number of rows on DataFrame and len(df. columns()) to get the number of columns.


1 Answers

I have something in mind, its just a rough estimation. as far as i know spark doesn't have a straight forward way to get dataframe memory usage, But Pandas dataframe does. so what you can do is.

  1. select 1% of data sample = df.sample(fraction = 0.01)
  2. pdf = sample.toPandas()
  3. get pandas dataframe memory usage by pdf.info()
  4. Multiply that values by 100, this should give a rough estimate of your whole spark dataframe memory usage.
  5. Correct me if i am wrong :|
like image 72
Vipin Chaudhary Avatar answered Sep 18 '22 14:09

Vipin Chaudhary