PySpark

Question

How can I view the count of each data type in a Spark Dataframe like I would if I used a pandas dataframe?

For example, assuming df is a pandas dataframe:

>>> df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
int_col      5 non-null int64
text_col     5 non-null object
float_col    5 non-null float64
**dtypes: float64(1), int64(1), object(1)**
memory usage: 200.0+ bytes

We can very clearly see the count of each data type. How can I do something similar with a spark dataframe? That is, how can see how many columns are float, how many columns are int, and how many columns are objects?

Thanks!

pault · Accepted Answer

I think the easiest way is to use collections.Counter:

df = spark.createDataFrame(
    [(1, 1.2, 'foo'), (2, 2.3, 'bar'), (None, 3.4, 'baz')],
    ["int_col", "float_col", "string_col"]
)

from collections import Counter
print(Counter((x[1] for x in df.dtypes)))
#Counter({'double': 1, 'bigint': 1, 'string': 1})

And there's also pyspark.sql.DataFrame.describe() method:

df.describe().show()
+-------+------------------+------------------+----------+
|summary|           int_col|         float_col|string_col|
+-------+------------------+------------------+----------+
|  count|                 2|                 3|         3|
|   mean|               1.5|               2.3|      null|
| stddev|0.7071067811865476|1.0999999999999999|      null|
|    min|                 1|               1.2|       bar|
|    max|                 2|               3.4|       foo|
+-------+------------------+------------------+----------+

Notice that int_col has a count of 2, since one of the value is null in this example.

User12345 · Answer

The below code should get you your desired result

# create data frame 
df = sqlContext.createDataFrame(
[(1,'Y','Y',0,0,0,2,'Y','N','Y','Y'),
 (2,'N','Y',2,1,2,3,'N','Y','Y','N'),
 (3,'Y','N',3,1,0,0,'N','N','N','N'),
 (4,'N','Y',5,0,1,0,'N','N','N','Y'),
 (5,'Y','N',2,2,0,1,'Y','N','N','Y'),
 (6,'Y','Y',0,0,3,6,'Y','N','Y','N'),
 (7,'N','N',1,1,3,4,'N','Y','N','Y'),
 (8,'Y','Y',1,1,2,0,'Y','Y','N','N')
],
('id', 'compatible', 'product', 'ios', 'pc', 'other', 'devices', 'customer', 'subscriber', 'circle', 'smb')
)

# Find data types of data frame
datatypes_List = df.dtypes

# Querying datatypes_List gives you column and its data type as a tuple
datatypes_List
[('id', 'bigint'), ('compatible', 'string'), ('product', 'string'), ('ios', 'bigint'), ('pc', 'bigint'), ('other', 'bigint'), ('devices', 'bigint'), ('customer', 'string'), ('subscriber', 'string'), ('circle', 'string'), ('smb', 'string')]

# create empty dictonary to store output values
dict_count = {}

# Loop statement to count number of times the data type is present in the data frame
for x, y in datatypes_List:
    dict_count[y] = dict_count.get(y, 0) + 1


# query dict_count to find the number of times a data type is present in data frame
dict_count

PySpark - Show a count of column data types in a dataframe

Tags:

python

apache-spark

brie

2 Answers

pault

User12345

Recent Activity

Donate For Us