Is there an equivalent method to pandas info() method in PySpark?
I am trying to gain basic statistics about a dataframe in PySpark, such as: Number of columns and rows Number of nulls Size of dataframe
Info() method in pandas provides all these statistics.
Also there is summary method to get row numbers and some other descritive statistics. It is similar to describe method already mentioned.
From PySpark manual:
df.summary().show()
+-------+------------------+-----+
|summary| age| name|
+-------+------------------+-----+
| count| 2| 2|
| mean| 3.5| null|
| stddev|2.1213203435596424| null|
| min| 2|Alice|
| 25%| 2| null|
| 50%| 2| null|
| 75%| 5| null|
| max| 5| Bob|
+-------+------------------+-----+
or
df.select("age", "name").summary("count").show()
+-------+---+----+
|summary|age|name|
+-------+---+----+
| count| 2| 2|
+-------+---+----+
To figure out type information about data frame you could try df.schema
spark.read.csv('matchCount.csv',header=True).printSchema()
StructType(List(StructField(categ,StringType,true),StructField(minv,StringType,true),StructField(maxv,StringType,true),StructField(counts,StringType,true),StructField(cutoff,StringType,true)))
For Summary stats you could also have a look at describe method from the documentation.
Check this answer to get a count of the null and not null values.
from pyspark.sql.functions import isnan, when, count, col
import numpy as np
df = spark.createDataFrame(
[(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))],
('session', "timestamp1", "id2"))
df.show()
# +-------+----------+----+
# |session|timestamp1| id2|
# +-------+----------+----+
# | 1| 1|null|
# | 1| 2| 5.0|
# | 1| 3| NaN|
# | 2| 4|null|
# | 1| 5|10.0|
# | 1| 6| NaN|
# | 1| 6| NaN|
# +-------+----------+----+
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
# +-------+----------+---+
# |session|timestamp1|id2|
# +-------+----------+---+
# | 0| 0| 3|
# +-------+----------+---+
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
# +-------+----------+---+
# |session|timestamp1|id2|
# +-------+----------+---+
# | 0| 0| 5|
# +-------+----------+---+
df.describe().show()
# +-------+-------+------------------+---+
# |summary|session| timestamp1|id2|
# +-------+-------+------------------+---+
# | count| 7| 7| 5|
# | mean| 1.0| 3.857142857142857|NaN|
# | stddev| 0.0|1.9518001458970662|NaN|
# | min| 1| 1|5.0|
# | max| 1| 6|NaN|
# +-------+-------+------------------+---
There is no equivalent to pandas.DataFrame.info()
that I know of.
PrintSchema
is useful, and toPandas.info()
works for small dataframes but When I use pandas.DataFrame.info()
I often look at the null values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With