I have a fairly large Parquet file which I am loading using
file = spark.read.parquet('hdfs/directory/test.parquet')
Now I want to get some statistics (similar to pandas describe()
function). What I've tried to do was:
file_pd = file.toPandas()
file_pd.describe()
but obviously this requires to load all the data in memory and it will fail. Can anyone suggest a workaround?
What are the stats you need? Spark has a similar feature
file.summary().show()
+-------+----+
|summary|test|
+-------+----+
| count| 3|
| mean| 2.0|
| stddev| 1.0|
| min| 1|
| 25%| 1|
| 50%| 2|
| 75%| 3|
| max| 3|
+-------+----+
In Spark you can use df.describe()
or df.summary()
to check statistical information.
The difference is that df.summary()
returns the same information as df.describe()
plus quartile information (25%, 50% and 75%).
If you want to delete string columns, you can use a list comprehension to access the values of dtypes
, which returns a tuple ('column_name', 'column_type'), and delete the string type, passing these columns as a parameter to df.select()
.
Command example:
df.select([col[0] for col in df.dtypes if col[1] != 'string']).describe().show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With