Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark: Is there an equivalent method to pandas info()?

Is there an equivalent method to pandas info() method in PySpark?

I am trying to gain basic statistics about a dataframe in PySpark, such as: Number of columns and rows Number of nulls Size of dataframe

Info() method in pandas provides all these statistics.

like image 890
Brian Waters Avatar asked Jun 07 '17 18:06

Brian Waters


3 Answers

Also there is summary method to get row numbers and some other descritive statistics. It is similar to describe method already mentioned.

From PySpark manual:

df.summary().show()
+-------+------------------+-----+
|summary|               age| name|
+-------+------------------+-----+
|  count|                 2|    2|
|   mean|               3.5| null|
| stddev|2.1213203435596424| null|
|    min|                 2|Alice|
|    25%|                 2| null|
|    50%|                 2| null|
|    75%|                 5| null|
|    max|                 5|  Bob|
+-------+------------------+-----+

or

df.select("age", "name").summary("count").show()
+-------+---+----+
|summary|age|name|
+-------+---+----+
|  count|  2|   2|
+-------+---+----+
like image 138
danielfs88 Avatar answered Nov 03 '22 07:11

danielfs88


To figure out type information about data frame you could try df.schema

spark.read.csv('matchCount.csv',header=True).printSchema()

StructType(List(StructField(categ,StringType,true),StructField(minv,StringType,true),StructField(maxv,StringType,true),StructField(counts,StringType,true),StructField(cutoff,StringType,true)))

For Summary stats you could also have a look at describe method from the documentation.

like image 26
StackPointer Avatar answered Nov 03 '22 07:11

StackPointer


Check this answer to get a count of the null and not null values.

from pyspark.sql.functions import isnan, when, count, col
import numpy as np

df = spark.createDataFrame(
    [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))],
    ('session', "timestamp1", "id2"))

df.show()
# +-------+----------+----+
# |session|timestamp1| id2|
# +-------+----------+----+
# |      1|         1|null|
# |      1|         2| 5.0|
# |      1|         3| NaN|
# |      2|         4|null|
# |      1|         5|10.0|
# |      1|         6| NaN|
# |      1|         6| NaN|
# +-------+----------+----+

df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
# +-------+----------+---+
# |session|timestamp1|id2|
# +-------+----------+---+
# |      0|         0|  3|
# +-------+----------+---+

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
# +-------+----------+---+
# |session|timestamp1|id2|
# +-------+----------+---+
# |      0|         0|  5|
# +-------+----------+---+

df.describe().show()
# +-------+-------+------------------+---+
# |summary|session|        timestamp1|id2|
# +-------+-------+------------------+---+
# |  count|      7|                 7|  5|
# |   mean|    1.0| 3.857142857142857|NaN|
# | stddev|    0.0|1.9518001458970662|NaN|
# |    min|      1|                 1|5.0|
# |    max|      1|                 6|NaN|
# +-------+-------+------------------+---

There is no equivalent to pandas.DataFrame.info() that I know of. PrintSchema is useful, and toPandas.info() works for small dataframes but When I use pandas.DataFrame.info() I often look at the null values.

like image 44
Daniel Fernandez Avatar answered Nov 03 '22 07:11

Daniel Fernandez