Describe a Dataframe on PySpark

Question

I have a fairly large Parquet file which I am loading using

file = spark.read.parquet('hdfs/directory/test.parquet')

Now I want to get some statistics (similar to pandas describe() function). What I've tried to do was:

file_pd = file.toPandas()
file_pd.describe()

but obviously this requires to load all the data in memory and it will fail. Can anyone suggest a workaround?

ollik1 · Accepted Answer

What are the stats you need? Spark has a similar feature

file.summary().show()

+-------+----+
|summary|test|
+-------+----+
|  count|   3|
|   mean| 2.0|
| stddev| 1.0|
|    min|   1|
|    25%|   1|
|    50%|   2|
|    75%|   3|
|    max|   3|
+-------+----+

gustavolq · Answer

In Spark you can use df.describe() or df.summary() to check statistical information.

The difference is that df.summary() returns the same information as df.describe() plus quartile information (25%, 50% and 75%).

If you want to delete string columns, you can use a list comprehension to access the values of dtypes, which returns a tuple ('column_name', 'column_type'), and delete the string type, passing these columns as a parameter to df.select().

Command example:

df.select([col[0] for col in df.dtypes if col[1] != 'string']).describe().show()

Describe a Dataframe on PySpark

Tags:

python

pandas

apache-spark

pyspark

Tokyo

2 Answers

ollik1

gustavolq

Recent Activity

Donate For Us

Describe a Dataframe on PySpark

Tags:

python

pandas

apache-spark

pyspark

Tokyo

2 Answers

ollik1

gustavolq

Related questions

Recent Activity

Donate For Us