i have a pandas dataframe: <pre class="prettyprint"><code>item_code price 1 15 1 30 1 60 2 50 3 90 4 110 5 130 4 150 </code></pre> We can see that the max price is 150. i want to divide it into 5 bins of 30 each(into new columns) and get the count of occurance of each item code in that price bin. final df= <pre class="prettyprint"><code>item_code 0-30 31-60 61-90 91-120 121-150 1 2 1 0 0 0 2 0 1 0 0 0 3 0 0 1 0 0 4 0 0 0 1 1 5 0 0 0 0 1 </code></pre> i.e <code>item_code 1</code> falls twice in the price range 0-30 therefore under column <code>0-30</code> put count as 2. <code>item_code 1</code> falls once in price range 31-60. Therefore put count as 1.... Similarly for other item codes. I tried using pd.cut <pre class="prettyprint"><code>bins = [0, 30, 60, 90, 120,150] df2 = pd.cut(df['price'], bins) </code></pre> But it did not work.

Setup <pre class="prettyprint"><code>cats = ['0-30', '31-60', '61-90', '91-120', '121-150'] bins = [0, 30, 60, 90, 120, 150] </code></pre> <hr> Option 1 Use <code>pd.get_dummies</code> and <code>pd.DataFrame.join</code> <pre class="prettyprint"><code>df[['item_code']].join(pd.get_dummies(pd.cut(df.price, bins, labels=cats))) item_code 0-30 31-60 61-90 91-120 121-150 0 1 1 0 0 0 0 1 1 1 0 0 0 0 2 1 0 1 0 0 0 3 2 0 1 0 0 0 4 3 0 0 1 0 0 5 4 0 0 0 1 0 6 5 0 0 0 0 1 7 4 0 0 0 0 1 </code></pre> <hr> Option 2 Using numpy's <code>searchsorted</code> and some string array addition. <pre class="prettyprint"><code>from numpy.core.defchararray import add bins = np.arange(30, 121, 30) b = bins.astype(str) cats = add(add(np.append('0', b), '-'), np.append(b, '150')) df[['item_code']].join(pd.get_dummies(cats[bins.searchsorted(df.price)])) item_code 0-30 120-150 30-60 60-90 90-120 0 1 1 0 0 0 0 1 1 1 0 0 0 0 2 1 0 0 1 0 0 3 2 0 0 1 0 0 4 3 0 0 0 1 0 5 4 0 0 0 0 1 6 5 0 1 0 0 0 7 4 0 1 0 0 0 </code></pre> <hr> If you are looking to sum the like valued <code>item_code</code>s. You can use <code>groupby</code> instead of <code>join</code> <pre class="prettyprint"><code>from numpy.core.defchararray import add bins = np.arange(30, 121, 30) b = bins.astype(str) cats = add(add(np.append('0', b), '-'), np.append(b, '150')) pd.get_dummies(cats[bins.searchsorted(df.price)]).groupby(df.item_code).sum().reset_index() item_code 0-30 120-150 30-60 60-90 90-120 0 1 2 0 1 0 0 1 2 0 0 1 0 0 2 3 0 0 0 1 0 3 4 0 1 0 0 1 4 5 0 1 0 0 0 </code></pre> <hr> Option 3 A very fast approach using <code>pd.factorize</code> and <code>np.bincount</code> <pre class="prettyprint"><code>from numpy.core.defchararray import add bins = np.arange(30, 121, 30) b = bins.astype(str) cats = add(add(np.append('0', b), '-'), np.append(b, '150')) j, c = pd.factorize(bins.searchsorted(df.price)) i, r = pd.factorize(df.item_code.values) n, m = c.size, r.size pd.DataFrame( np.bincount(i * m + j, minlength=n * m).reshape(n, m), r, cats).rename_axis('item_code').reset_index() item_code 0-30 30-60 60-90 90-120 120-150 0 1 2 1 0 0 0 1 2 0 1 0 0 0 2 3 0 0 1 0 0 3 4 0 0 0 1 1 4 5 0 0 0 0 1 </code></pre>

Add parameter labels to <code>cut</code> and then <code>groupby</code> and aggregate <code>size</code>: <pre class="prettyprint"><code>cats = ['0-30','31-60','61-90','91-120','121-150'] bins = [0, 30, 60, 90, 120,150] df2 = (df.groupby(['item_code', pd.cut(df['price'], bins, labels=cats)]) .size() .unstack(fill_value=0)) print (df2) price 0-30 31-60 61-90 91-120 121-150 item_code 1 2 1 0 0 0 2 0 1 0 0 0 3 0 0 1 0 0 4 0 0 0 1 1 5 0 0 0 0 1 </code></pre> EDIT If you want general solution, add <code>reindex</code>: <pre class="prettyprint"><code>print (df) item_code price 0 1 15 1 1 30 2 1 60 3 2 50 4 3 90 5 4 110 cats = ['0-30','31-60','61-90','91-120','121-150'] bins = [0, 30, 60, 90, 120,150] df2 = (df.groupby(['item_code', pd.cut(df['price'], bins, labels=cats)]) .size() .unstack(fill_value=0) .reindex(columns=cats, fill_value=0)) print (df2) price 0-30 31-60 61-90 91-120 121-150 item_code 1 2 1 0 0 0 2 0 1 0 0 0 3 0 0 1 0 0 4 0 0 0 1 0 </code></pre>

Creating bins of a column and getting the count in pandas

Tags:

pandas

i have a pandas dataframe:

item_code    price
   1           15
   1           30
   1           60
   2           50
   3           90
   4           110
   5           130
   4           150

We can see that the max price is 150. i want to divide it into 5 bins of 30 each(into new columns) and get the count of occurance of each item code in that price bin.

final df=

item_code    0-30    31-60    61-90    91-120    121-150
    1         2         1       0         0          0
    2         0         1       0         0          0
    3         0         0       1         0          0
    4         0         0       0         1          1
    5         0         0       0         0          1

i.e item_code 1 falls twice in the price range 0-30 therefore under column 0-30 put count as 2. item_code 1 falls once in price range 31-60. Therefore put count as 1.... Similarly for other item codes.

I tried using pd.cut

bins = [0, 30, 60, 90, 120,150]
df2 = pd.cut(df['price'], bins)

But it did not work.

290

asked Oct 18 '17 05:10

Shubham R

2 Answers

Setup

cats = ['0-30', '31-60', '61-90', '91-120', '121-150']
bins = [0, 30, 60, 90, 120, 150]

Option 1
Use pd.get_dummies and pd.DataFrame.join

df[['item_code']].join(pd.get_dummies(pd.cut(df.price, bins, labels=cats)))

   item_code  0-30  31-60  61-90  91-120  121-150
0          1     1      0      0       0        0
1          1     1      0      0       0        0
2          1     0      1      0       0        0
3          2     0      1      0       0        0
4          3     0      0      1       0        0
5          4     0      0      0       1        0
6          5     0      0      0       0        1
7          4     0      0      0       0        1

Option 2
Using numpy's searchsorted and some string array addition.

from numpy.core.defchararray import add

bins = np.arange(30, 121, 30)

b = bins.astype(str)
cats = add(add(np.append('0', b), '-'), np.append(b, '150'))

df[['item_code']].join(pd.get_dummies(cats[bins.searchsorted(df.price)]))

   item_code  0-30  120-150  30-60  60-90  90-120
0          1     1        0      0      0       0
1          1     1        0      0      0       0
2          1     0        0      1      0       0
3          2     0        0      1      0       0
4          3     0        0      0      1       0
5          4     0        0      0      0       1
6          5     0        1      0      0       0
7          4     0        1      0      0       0

If you are looking to sum the like valued item_codes. You can use groupby instead of join

from numpy.core.defchararray import add

bins = np.arange(30, 121, 30)

b = bins.astype(str)
cats = add(add(np.append('0', b), '-'), np.append(b, '150'))

pd.get_dummies(cats[bins.searchsorted(df.price)]).groupby(df.item_code).sum().reset_index()

   item_code  0-30  120-150  30-60  60-90  90-120
0          1     2        0      1      0       0
1          2     0        0      1      0       0
2          3     0        0      0      1       0
3          4     0        1      0      0       1
4          5     0        1      0      0       0

Option 3
A very fast approach using pd.factorize and np.bincount

from numpy.core.defchararray import add

bins = np.arange(30, 121, 30)

b = bins.astype(str)
cats = add(add(np.append('0', b), '-'), np.append(b, '150'))

j, c = pd.factorize(bins.searchsorted(df.price))
i, r = pd.factorize(df.item_code.values)
n, m = c.size, r.size

pd.DataFrame(
    np.bincount(i * m + j, minlength=n * m).reshape(n, m),
    r, cats).rename_axis('item_code').reset_index()

   item_code  0-30  30-60  60-90  90-120  120-150
0          1     2      1      0       0        0
1          2     0      1      0       0        0
2          3     0      0      1       0        0
3          4     0      0      0       1        1
4          5     0      0      0       0        1

answered Oct 20 '22 13:10

piRSquared

Add parameter labels to cut and then groupby and aggregate size:

cats = ['0-30','31-60','61-90','91-120','121-150']
bins = [0, 30, 60, 90, 120,150]
df2 = (df.groupby(['item_code', pd.cut(df['price'], bins, labels=cats)])
         .size()
         .unstack(fill_value=0))
print (df2)
price      0-30  31-60  61-90  91-120  121-150
item_code                                     
1             2      1      0       0        0
2             0      1      0       0        0
3             0      0      1       0        0
4             0      0      0       1        1
5             0      0      0       0        1

EDIT If you want general solution, add reindex:

print (df)
   item_code  price
0          1     15
1          1     30
2          1     60
3          2     50
4          3     90
5          4    110

cats = ['0-30','31-60','61-90','91-120','121-150']
bins = [0, 30, 60, 90, 120,150]
df2 = (df.groupby(['item_code', pd.cut(df['price'], bins, labels=cats)])
        .size()
        .unstack(fill_value=0)
        .reindex(columns=cats, fill_value=0))
print (df2)
price      0-30  31-60  61-90  91-120  121-150
item_code                                     
1             2      1      0       0        0
2             0      1      0       0        0
3             0      0      1       0        0
4             0      0      0       1        0

answered Oct 20 '22 12:10

jezrael

Related questions
                            
                                pandas: conditional count across row
                            
                                Median of a list with NaN values removed, in python
                            
                                pandas - check for non unique values in dataframe groupby
                            
                                Histogram fitting with python
                            
                                How to calculate percent change compared to the beginning value using pandas?
                            
                                Transforming Dataframe columns into Dataframe of rows
                            
                                pandas - scatter plot with different color legend for each point
                            
                                Pandas gives an error from str.extractall('#')
                            
                                Pandas: return dataframe where one column's values are greater than another
                            
                                Python pandas convert list of comma separated values to dataframe
                            
                                Intersection of multiple pandas dataframes
                            
                                pandas aggregate count in dataframe
                            
                                Pandas : Use groupby on each element of list
                            
                                Renaming columns using numbers from a range in python/pandas
                            
                                Select a List Slices of a Pandas Multiindex/Multicolumn DataFrame
                            
                                Panda .loc or .iloc to select the columns from a dataset
                            
                                pandas: cannot do positional indexing on DatetimeIndex with these indexers [2016-08-01 00:00:00] of Timestamp
                            
                                Pandas assigning random string to each group as new column
                            
                                pandas groupby and boolean selection
                            
                                remove rows and ValueError Arrays were different lengths

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With