Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

binning a dataframe in pandas in Python [duplicate]

People also ask

How do you binning a panda in Python?

In Python pandas binning by distance is achieved by means of the cut() function. We group values related to the column Cupcake into three groups: small, medium and big. In order to do it, we need to calculate the intervals within each group falls.

How do you split data into bins in Python?

Use pd. cut() for binning data based on the range of possible values. Use pd. qcut() for binning data based on the actual distribution of values.

How do I duplicate DF Pandas?

Pandas DataFrame duplicated() MethodThe duplicated() method returns a Series with True and False values that describe which rows in the DataFrame are duplicated and not. Use the subset parameter to specify if any columns should not be considered when looking for duplicates.


There may be a more efficient way (I have a feeling pandas.crosstab would be useful here), but here's how I'd do it:

import numpy as np
import pandas

df = pandas.DataFrame({"a": np.random.random(100),
                       "b": np.random.random(100),
                       "id": np.arange(100)})

# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(np.digitize(df.a, bins))

# Get the mean of each bin:
print groups.mean() # Also could do "groups.aggregate(np.mean)"

# Similarly, the median:
print groups.median()

# Apply some arbitrary function to aggregate binned data
print groups.aggregate(lambda x: np.mean(x[x > 0.5]))

Edit: As the OP was asking specifically for just the means of b binned by the values in a, just do

groups.mean().b

Also if you wanted the index to look nicer (e.g. display intervals as the index), as they do in @bdiamante's example, use pandas.cut instead of numpy.digitize. (Kudos to bidamante. I didn't realize pandas.cut existed.)

import numpy as np
import pandas

df = pandas.DataFrame({"a": np.random.random(100), 
                       "b": np.random.random(100) + 10})

# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(pandas.cut(df.a, bins))

# Get the mean of b, binned by the values in a
print groups.mean().b

This results in:

a
(0.00186, 0.111]    10.421839
(0.111, 0.22]       10.427540
(0.22, 0.33]        10.538932
(0.33, 0.439]       10.445085
(0.439, 0.548]      10.313612
(0.548, 0.658]      10.319387
(0.658, 0.767]      10.367444
(0.767, 0.876]      10.469655
(0.876, 0.986]      10.571008
Name: b

Not 100% sure if this is what you're looking for, but here's what I think you're getting at:

In [144]: df = DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id":   np.arange(100)})

In [145]: bins = [0, .25, .5, .75, 1]

In [146]: a_bins = df.a.groupby(cut(df.a,bins))

In [147]: b_bins = df.b.groupby(cut(df.b,bins))

In [148]: a_bins.agg([mean,median])
Out[148]:
                 mean    median
a
(0, 0.25]    0.124173  0.114613
(0.25, 0.5]  0.367703  0.358866
(0.5, 0.75]  0.624251  0.626730
(0.75, 1]    0.875395  0.869843

In [149]: b_bins.agg([mean,median])
Out[149]:
                 mean    median
b
(0, 0.25]    0.147936  0.166900
(0.25, 0.5]  0.394918  0.386729
(0.5, 0.75]  0.636111  0.655247
(0.75, 1]    0.851227  0.838805

Of course, I don't know what bins you had in mind, so you'll have to swap mine out for your circumstance.


Joe Kington's answer was very helpful, however, I noticed that it does not bin all of the data. It actually leaves out the row with a = a.min(). Summing up groups.size() gave 99 instead of 100.

To guarantee that all data is binned, just pass in the number of bins to cut() and that function will automatically pad the first[last] bin by 0.1% to ensure all data is included.

df = pandas.DataFrame({"a": np.random.random(100), 
                    "b": np.random.random(100) + 10})

# Bin the data frame by "a" with 10 bins...
groups = df.groupby(pandas.cut(df.a, 10))

# Get the mean of b, binned by the values in a
print(groups.mean().b)

In this case, summing up groups.size() gave 100.

I know this is a picky point for this particular problem, but for a similar problem I was trying to solve, it was crucial to obtain the correct answer.


If you do not have to stick to pandas grouping, you could use scipy.stats.binned_statistic:

from scipy.stats import binned_statistic

means = binned_statistic(df.a, df.b, bins=np.linspace(min(df.a), max(df.a), 10))