Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark - Show a count of column data types in a dataframe

How can I view the count of each data type in a Spark Dataframe like I would if I used a pandas dataframe?

For example, assuming df is a pandas dataframe:

>>> df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
int_col      5 non-null int64
text_col     5 non-null object
float_col    5 non-null float64
**dtypes: float64(1), int64(1), object(1)**
memory usage: 200.0+ bytes

We can very clearly see the count of each data type. How can I do something similar with a spark dataframe? That is, how can see how many columns are float, how many columns are int, and how many columns are objects?

Thanks!

like image 502
brie Avatar asked Mar 05 '23 16:03

brie


2 Answers

I think the easiest way is to use collections.Counter:

df = spark.createDataFrame(
    [(1, 1.2, 'foo'), (2, 2.3, 'bar'), (None, 3.4, 'baz')],
    ["int_col", "float_col", "string_col"]
)

from collections import Counter
print(Counter((x[1] for x in df.dtypes)))
#Counter({'double': 1, 'bigint': 1, 'string': 1})

And there's also pyspark.sql.DataFrame.describe() method:

df.describe().show()
+-------+------------------+------------------+----------+
|summary|           int_col|         float_col|string_col|
+-------+------------------+------------------+----------+
|  count|                 2|                 3|         3|
|   mean|               1.5|               2.3|      null|
| stddev|0.7071067811865476|1.0999999999999999|      null|
|    min|                 1|               1.2|       bar|
|    max|                 2|               3.4|       foo|
+-------+------------------+------------------+----------+

Notice that int_col has a count of 2, since one of the value is null in this example.

like image 131
pault Avatar answered Mar 14 '23 23:03

pault


The below code should get you your desired result

# create data frame 
df = sqlContext.createDataFrame(
[(1,'Y','Y',0,0,0,2,'Y','N','Y','Y'),
 (2,'N','Y',2,1,2,3,'N','Y','Y','N'),
 (3,'Y','N',3,1,0,0,'N','N','N','N'),
 (4,'N','Y',5,0,1,0,'N','N','N','Y'),
 (5,'Y','N',2,2,0,1,'Y','N','N','Y'),
 (6,'Y','Y',0,0,3,6,'Y','N','Y','N'),
 (7,'N','N',1,1,3,4,'N','Y','N','Y'),
 (8,'Y','Y',1,1,2,0,'Y','Y','N','N')
],
('id', 'compatible', 'product', 'ios', 'pc', 'other', 'devices', 'customer', 'subscriber', 'circle', 'smb')
)

# Find data types of data frame
datatypes_List = df.dtypes

# Querying datatypes_List gives you column and its data type as a tuple
datatypes_List
[('id', 'bigint'), ('compatible', 'string'), ('product', 'string'), ('ios', 'bigint'), ('pc', 'bigint'), ('other', 'bigint'), ('devices', 'bigint'), ('customer', 'string'), ('subscriber', 'string'), ('circle', 'string'), ('smb', 'string')]

# create empty dictonary to store output values
dict_count = {}

# Loop statement to count number of times the data type is present in the data frame
for x, y in datatypes_List:
    dict_count[y] = dict_count.get(y, 0) + 1


# query dict_count to find the number of times a data type is present in data frame
dict_count  
like image 30
User12345 Avatar answered Mar 14 '23 22:03

User12345