In Python pandas binning by distance is achieved by means of the cut() function. We group values related to the column Cupcake into three groups: small, medium and big. In order to do it, we need to calculate the intervals within each group falls.
Use pd. cut() for binning data based on the range of possible values. Use pd. qcut() for binning data based on the actual distribution of values.
Pandas DataFrame duplicated() MethodThe duplicated() method returns a Series with True and False values that describe which rows in the DataFrame are duplicated and not. Use the subset parameter to specify if any columns should not be considered when looking for duplicates.
There may be a more efficient way (I have a feeling pandas.crosstab
would be useful here), but here's how I'd do it:
import numpy as np
import pandas
df = pandas.DataFrame({"a": np.random.random(100),
"b": np.random.random(100),
"id": np.arange(100)})
# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(np.digitize(df.a, bins))
# Get the mean of each bin:
print groups.mean() # Also could do "groups.aggregate(np.mean)"
# Similarly, the median:
print groups.median()
# Apply some arbitrary function to aggregate binned data
print groups.aggregate(lambda x: np.mean(x[x > 0.5]))
Edit: As the OP was asking specifically for just the means of b
binned by the values in a
, just do
groups.mean().b
Also if you wanted the index to look nicer (e.g. display intervals as the index), as they do in @bdiamante's example, use pandas.cut
instead of numpy.digitize
. (Kudos to bidamante. I didn't realize pandas.cut
existed.)
import numpy as np
import pandas
df = pandas.DataFrame({"a": np.random.random(100),
"b": np.random.random(100) + 10})
# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(pandas.cut(df.a, bins))
# Get the mean of b, binned by the values in a
print groups.mean().b
This results in:
a
(0.00186, 0.111] 10.421839
(0.111, 0.22] 10.427540
(0.22, 0.33] 10.538932
(0.33, 0.439] 10.445085
(0.439, 0.548] 10.313612
(0.548, 0.658] 10.319387
(0.658, 0.767] 10.367444
(0.767, 0.876] 10.469655
(0.876, 0.986] 10.571008
Name: b
Not 100% sure if this is what you're looking for, but here's what I think you're getting at:
In [144]: df = DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id": np.arange(100)})
In [145]: bins = [0, .25, .5, .75, 1]
In [146]: a_bins = df.a.groupby(cut(df.a,bins))
In [147]: b_bins = df.b.groupby(cut(df.b,bins))
In [148]: a_bins.agg([mean,median])
Out[148]:
mean median
a
(0, 0.25] 0.124173 0.114613
(0.25, 0.5] 0.367703 0.358866
(0.5, 0.75] 0.624251 0.626730
(0.75, 1] 0.875395 0.869843
In [149]: b_bins.agg([mean,median])
Out[149]:
mean median
b
(0, 0.25] 0.147936 0.166900
(0.25, 0.5] 0.394918 0.386729
(0.5, 0.75] 0.636111 0.655247
(0.75, 1] 0.851227 0.838805
Of course, I don't know what bins you had in mind, so you'll have to swap mine out for your circumstance.
Joe Kington's answer was very helpful, however, I noticed that it does not bin all of the data. It actually leaves out the row with a = a.min(). Summing up groups.size()
gave 99 instead of 100.
To guarantee that all data is binned, just pass in the number of bins to cut() and that function will automatically pad the first[last] bin by 0.1% to ensure all data is included.
df = pandas.DataFrame({"a": np.random.random(100),
"b": np.random.random(100) + 10})
# Bin the data frame by "a" with 10 bins...
groups = df.groupby(pandas.cut(df.a, 10))
# Get the mean of b, binned by the values in a
print(groups.mean().b)
In this case, summing up groups.size() gave 100.
I know this is a picky point for this particular problem, but for a similar problem I was trying to solve, it was crucial to obtain the correct answer.
If you do not have to stick to pandas
grouping, you could use scipy.stats.binned_statistic
:
from scipy.stats import binned_statistic
means = binned_statistic(df.a, df.b, bins=np.linspace(min(df.a), max(df.a), 10))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With