I have a Pandas DataFrame -
>>> import numpy as np
>>> import pandas as pd
>>> data = pd.DataFrame(np.random.randint(low=0, high=2,size=(5,3)),
... columns=['A', 'B', 'C'])
>>> data
A B C
0 0 1 0
1 1 0 1
2 1 0 1
3 0 1 1
4 1 1 0
Now I use this to get the count of rows only for column A
>>> data.ix[:, 'A'].value_counts()
1 3
0 2
dtype: int64
What is the most efficient way to get the count of rows for column A and B i.e something like the following output -
0 0 0
0 1 2
1 0 2
1 1 1
And then finally how can I convert it into a numpy array such as -
array([[0, 2],
[2, 1]])
Please give a solution that is also consistent with
>>>> data = pd.DataFrame(np.random.randint(low=0, high=2,size=(5,2)),
... columns=['A', 'B'])
You can use groupby size and then unstack:
In [11]: data.groupby(["A","B"]).size()
Out[11]:
A B
0 1 2
1 0 2
1 1
dtype: int64
In [12]: data.groupby(["A","B"]).size().unstack("B")
Out[12]:
B 0 1
A
0 NaN 2
1 2 1
In [13]: data.groupby(["A","B"]).size().unstack("B").fillna(0)
Out[13]:
B 0 1
A
0 0 2
1 2 1
However whenever you do a groupby followed by an unstack you should think: pivot_table:
In [21]: data.pivot_table(index="A", columns="B", aggfunc="count", fill_value=0)
Out[21]:
C
B 0 1
A
0 0 2
1 2 1
This will be the most efficient solution as well as being the most direct.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With