Can someone please explain what does the line
result = data.apply(pd.value_counts).fillna(0)
does in here?
import pandas as pd
from pandas import Series, DataFrame
data = DataFrame({'Qu1': [1, 3, 4, 3, 4],
'Qu2': [2, 3, 1, 2, 3],
'Qu3': [1, 5, 2, 4, 4]})
result = data.apply(pd.value_counts).fillna(0)
In [26]:data
Out[26]:
Qu1 Qu2 Qu3
0 1 2 1
1 3 3 5
2 4 1 2
3 3 2 4
4 4 3 4
In [27]:result
Out[28]:
Qu1 Qu2 Qu3
1 1 1 1
2 0 2 1
3 2 2 0
4 2 0 2
5 0 0 1
I think the easiest way to understand what's going on is to break it down.
One each column, value_counts simply counts the number of occurrences of each value in the Series (i.e. in 4 appears twice in the Qu1 column):
In [11]: pd.value_counts(data.Qu1)
Out[11]:
4 2
3 2
1 1
dtype: int64
When you do an apply each column is realigned with the other results, since every value between 1 and 5 is seen it's aligned with range(1, 6):
In [12]: pd.value_counts(data.Qu1).reindex(range(1, 6))
Out[12]:
1 1
2 NaN
3 2
4 2
5 NaN
dtype: float64
You want to count the values you didn't see as 0 rather than NaN, hence the fillna:
In [13]: pd.value_counts(data.Qu1).reindex(range(1, 6)).fillna(0)
Out[13]:
1 1
2 0
3 2
4 2
5 0
dtype: float64
When you do the apply, it concats the result of doing this for each column:
In [14]: pd.concat((pd.value_counts(data[col]).reindex(range(1, 6)).fillna(0)
for col in data.columns),
axis=1, keys=data.columns)
Out[14]:
Qu1 Qu2 Qu3
1 1 1 1
2 0 2 1
3 2 2 0
4 2 0 2
5 0 0 1
From the docs, it produces a histogram of non-null values. Looking just at column Qu1 of result, we can tell that there is one 1, zero 2's, two 3's, two 4's, and zero 5's in the original column data.Qu1.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With