i have a pandas dataframe:
item_code price
1 15
1 30
1 60
2 50
3 90
4 110
5 130
4 150
We can see that the max price is 150. i want to divide it into 5 bins of 30 each(into new columns) and get the count of occurance of each item code in that price bin.
final df=
item_code 0-30 31-60 61-90 91-120 121-150
1 2 1 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 1
5 0 0 0 0 1
i.e item_code 1
falls twice in the price range 0-30 therefore under column 0-30
put count as 2. item_code 1
falls once in price range 31-60. Therefore put count as 1.... Similarly for other item codes.
I tried using pd.cut
bins = [0, 30, 60, 90, 120,150]
df2 = pd.cut(df['price'], bins)
But it did not work.
Pandas DataFrame count() MethodThe count() method counts the number of not empty values for each row, or column if you specify the axis parameter as axis='columns' , and returns a Series object with the result for each row (or column).
Use count() by Column NameUse pandas DataFrame. groupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values. It works with non-floating type data as well.
Use pd. cut() for binning data based on the range of possible values. Use pd. qcut() for binning data based on the actual distribution of values.
Setup
cats = ['0-30', '31-60', '61-90', '91-120', '121-150']
bins = [0, 30, 60, 90, 120, 150]
Option 1
Use pd.get_dummies
and pd.DataFrame.join
df[['item_code']].join(pd.get_dummies(pd.cut(df.price, bins, labels=cats)))
item_code 0-30 31-60 61-90 91-120 121-150
0 1 1 0 0 0 0
1 1 1 0 0 0 0
2 1 0 1 0 0 0
3 2 0 1 0 0 0
4 3 0 0 1 0 0
5 4 0 0 0 1 0
6 5 0 0 0 0 1
7 4 0 0 0 0 1
Option 2
Using numpy's searchsorted
and some string array addition.
from numpy.core.defchararray import add
bins = np.arange(30, 121, 30)
b = bins.astype(str)
cats = add(add(np.append('0', b), '-'), np.append(b, '150'))
df[['item_code']].join(pd.get_dummies(cats[bins.searchsorted(df.price)]))
item_code 0-30 120-150 30-60 60-90 90-120
0 1 1 0 0 0 0
1 1 1 0 0 0 0
2 1 0 0 1 0 0
3 2 0 0 1 0 0
4 3 0 0 0 1 0
5 4 0 0 0 0 1
6 5 0 1 0 0 0
7 4 0 1 0 0 0
If you are looking to sum the like valued item_code
s. You can use groupby
instead of join
from numpy.core.defchararray import add
bins = np.arange(30, 121, 30)
b = bins.astype(str)
cats = add(add(np.append('0', b), '-'), np.append(b, '150'))
pd.get_dummies(cats[bins.searchsorted(df.price)]).groupby(df.item_code).sum().reset_index()
item_code 0-30 120-150 30-60 60-90 90-120
0 1 2 0 1 0 0
1 2 0 0 1 0 0
2 3 0 0 0 1 0
3 4 0 1 0 0 1
4 5 0 1 0 0 0
Option 3
A very fast approach using pd.factorize
and np.bincount
from numpy.core.defchararray import add
bins = np.arange(30, 121, 30)
b = bins.astype(str)
cats = add(add(np.append('0', b), '-'), np.append(b, '150'))
j, c = pd.factorize(bins.searchsorted(df.price))
i, r = pd.factorize(df.item_code.values)
n, m = c.size, r.size
pd.DataFrame(
np.bincount(i * m + j, minlength=n * m).reshape(n, m),
r, cats).rename_axis('item_code').reset_index()
item_code 0-30 30-60 60-90 90-120 120-150
0 1 2 1 0 0 0
1 2 0 1 0 0 0
2 3 0 0 1 0 0
3 4 0 0 0 1 1
4 5 0 0 0 0 1
Add parameter labels to cut
and then groupby
and aggregate size
:
cats = ['0-30','31-60','61-90','91-120','121-150']
bins = [0, 30, 60, 90, 120,150]
df2 = (df.groupby(['item_code', pd.cut(df['price'], bins, labels=cats)])
.size()
.unstack(fill_value=0))
print (df2)
price 0-30 31-60 61-90 91-120 121-150
item_code
1 2 1 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 1
5 0 0 0 0 1
EDIT If you want general solution, add reindex
:
print (df)
item_code price
0 1 15
1 1 30
2 1 60
3 2 50
4 3 90
5 4 110
cats = ['0-30','31-60','61-90','91-120','121-150']
bins = [0, 30, 60, 90, 120,150]
df2 = (df.groupby(['item_code', pd.cut(df['price'], bins, labels=cats)])
.size()
.unstack(fill_value=0)
.reindex(columns=cats, fill_value=0))
print (df2)
price 0-30 31-60 61-90 91-120 121-150
item_code
1 2 1 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With