Can someone please explain what does the line
result = data.apply(pd.value_counts).fillna(0)
does in here?
import pandas as pd
from pandas import Series, DataFrame
data = DataFrame({'Qu1': [1, 3, 4, 3, 4],
'Qu2': [2, 3, 1, 2, 3],
'Qu3': [1, 5, 2, 4, 4]})
result = data.apply(pd.value_counts).fillna(0)
In [26]:data
Out[26]:
Qu1 Qu2 Qu3
0 1 2 1
1 3 3 5
2 4 1 2
3 3 2 4
4 4 3 4
In [27]:result
Out[28]:
Qu1 Qu2 Qu3
1 1 1 1
2 0 2 1
3 2 2 0
4 2 0 2
5 0 0 1
I think the easiest way to understand what's going on is to break it down.
One each column, value_counts simply counts the number of occurrences of each value in the Series (i.e. in 4 appears twice in the Qu1 column):
In [11]: pd.value_counts(data.Qu1)
Out[11]:
4 2
3 2
1 1
dtype: int64
When you do an apply each column is realigned with the other results, since every value between 1 and 5 is seen it's aligned with range(1, 6)
:
In [12]: pd.value_counts(data.Qu1).reindex(range(1, 6))
Out[12]:
1 1
2 NaN
3 2
4 2
5 NaN
dtype: float64
You want to count the values you didn't see as 0 rather than NaN, hence the fillna:
In [13]: pd.value_counts(data.Qu1).reindex(range(1, 6)).fillna(0)
Out[13]:
1 1
2 0
3 2
4 2
5 0
dtype: float64
When you do the apply, it concats the result of doing this for each column:
In [14]: pd.concat((pd.value_counts(data[col]).reindex(range(1, 6)).fillna(0)
for col in data.columns),
axis=1, keys=data.columns)
Out[14]:
Qu1 Qu2 Qu3
1 1 1 1
2 0 2 1
3 2 2 0
4 2 0 2
5 0 0 1
From the docs, it produces a histogram of non-null values. Looking just at column Qu1
of result
, we can tell that there is one 1, zero 2's, two 3's, two 4's, and zero 5's in the original column data.Qu1
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With