Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Describe a Dataframe on PySpark

I have a fairly large Parquet file which I am loading using

file = spark.read.parquet('hdfs/directory/test.parquet')

Now I want to get some statistics (similar to pandas describe() function). What I've tried to do was:

file_pd = file.toPandas()
file_pd.describe()

but obviously this requires to load all the data in memory and it will fail. Can anyone suggest a workaround?

like image 722
Tokyo Avatar asked Dec 03 '22 18:12

Tokyo


2 Answers

What are the stats you need? Spark has a similar feature

file.summary().show()
+-------+----+
|summary|test|
+-------+----+
|  count|   3|
|   mean| 2.0|
| stddev| 1.0|
|    min|   1|
|    25%|   1|
|    50%|   2|
|    75%|   3|
|    max|   3|
+-------+----+
like image 122
ollik1 Avatar answered Dec 15 '22 01:12

ollik1


In Spark you can use df.describe() or df.summary() to check statistical information.

The difference is that df.summary() returns the same information as df.describe() plus quartile information (25%, 50% and 75%).

If you want to delete string columns, you can use a list comprehension to access the values of dtypes, which returns a tuple ('column_name', 'column_type'), and delete the string type, passing these columns as a parameter to df.select().

Command example:

df.select([col[0] for col in df.dtypes if col[1] != 'string']).describe().show()
like image 20
gustavolq Avatar answered Dec 14 '22 23:12

gustavolq