binning a dataframe in pandas in Python [duplicate]

People also ask

How do you binning a panda in Python?

In Python pandas binning by distance is achieved by means of the cut() function. We group values related to the column Cupcake into three groups: small, medium and big. In order to do it, we need to calculate the intervals within each group falls.

How do you split data into bins in Python?

Use pd. cut() for binning data based on the range of possible values. Use pd. qcut() for binning data based on the actual distribution of values.

How do I duplicate DF Pandas?

Pandas DataFrame duplicated() MethodThe duplicated() method returns a Series with True and False values that describe which rows in the DataFrame are duplicated and not. Use the subset parameter to specify if any columns should not be considered when looking for duplicates.

There may be a more efficient way (I have a feeling pandas.crosstab would be useful here), but here's how I'd do it:

import numpy as np
import pandas

df = pandas.DataFrame({"a": np.random.random(100),
                       "b": np.random.random(100),
                       "id": np.arange(100)})

# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(np.digitize(df.a, bins))

# Get the mean of each bin:
print groups.mean() # Also could do "groups.aggregate(np.mean)"

# Similarly, the median:
print groups.median()

# Apply some arbitrary function to aggregate binned data
print groups.aggregate(lambda x: np.mean(x[x > 0.5]))

Edit: As the OP was asking specifically for just the means of b binned by the values in a, just do

groups.mean().b

Also if you wanted the index to look nicer (e.g. display intervals as the index), as they do in @bdiamante's example, use pandas.cut instead of numpy.digitize. (Kudos to bidamante. I didn't realize pandas.cut existed.)

import numpy as np
import pandas

df = pandas.DataFrame({"a": np.random.random(100), 
                       "b": np.random.random(100) + 10})

# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(pandas.cut(df.a, bins))

# Get the mean of b, binned by the values in a
print groups.mean().b

This results in:

a
(0.00186, 0.111]    10.421839
(0.111, 0.22]       10.427540
(0.22, 0.33]        10.538932
(0.33, 0.439]       10.445085
(0.439, 0.548]      10.313612
(0.548, 0.658]      10.319387
(0.658, 0.767]      10.367444
(0.767, 0.876]      10.469655
(0.876, 0.986]      10.571008
Name: b

Not 100% sure if this is what you're looking for, but here's what I think you're getting at:

In [144]: df = DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id":   np.arange(100)})

In [145]: bins = [0, .25, .5, .75, 1]

In [146]: a_bins = df.a.groupby(cut(df.a,bins))

In [147]: b_bins = df.b.groupby(cut(df.b,bins))

In [148]: a_bins.agg([mean,median])
Out[148]:
                 mean    median
a
(0, 0.25]    0.124173  0.114613
(0.25, 0.5]  0.367703  0.358866
(0.5, 0.75]  0.624251  0.626730
(0.75, 1]    0.875395  0.869843

In [149]: b_bins.agg([mean,median])
Out[149]:
                 mean    median
b
(0, 0.25]    0.147936  0.166900
(0.25, 0.5]  0.394918  0.386729
(0.5, 0.75]  0.636111  0.655247
(0.75, 1]    0.851227  0.838805

Of course, I don't know what bins you had in mind, so you'll have to swap mine out for your circumstance.

Joe Kington's answer was very helpful, however, I noticed that it does not bin all of the data. It actually leaves out the row with a = a.min(). Summing up groups.size() gave 99 instead of 100.

To guarantee that all data is binned, just pass in the number of bins to cut() and that function will automatically pad the first[last] bin by 0.1% to ensure all data is included.

df = pandas.DataFrame({"a": np.random.random(100), 
                    "b": np.random.random(100) + 10})

# Bin the data frame by "a" with 10 bins...
groups = df.groupby(pandas.cut(df.a, 10))

# Get the mean of b, binned by the values in a
print(groups.mean().b)

In this case, summing up groups.size() gave 100.

I know this is a picky point for this particular problem, but for a similar problem I was trying to solve, it was crucial to obtain the correct answer.

If you do not have to stick to pandas grouping, you could use scipy.stats.binned_statistic:

from scipy.stats import binned_statistic

means = binned_statistic(df.a, df.b, bins=np.linspace(min(df.a), max(df.a), 10))

Related questions
                            
                                Recursive unittest discover
                            
                                Python BeautifulSoup extract text between element
                            
                                graph.write_pdf("iris.pdf") AttributeError: 'list' object has no attribute 'write_pdf'
                            
                                SQLAlchemy: Scan huge tables using ORM?
                            
                                profiling a method of a class in Python using cProfile?
                            
                                ValueError: unsupported format character while forming strings
                            
                                Split text after the second occurrence of character
                            
                                Why can't I install python3.6-dev on Ubuntu16.04
                            
                                Why does '(base)' appear in my anaconda command prompt?
                            
                                How can I get an email message's text content using Python?
                            
                                Call method from string
                            
                                Your server socket listen backlog is limited to 100 connections
                            
                                How can I patch / mock logging.getlogger()
                            
                                Sort list of lists ascending and then descending
                            
                                Edit the width of bars using dataframe.plot() function in matplotlib
                            
                                Opening a SSL socket connection in Python
                            
                                Remove object from a list of objects in python
                            
                                'Syntax Error: invalid syntax' for no apparent reason
                            
                                Truncate `TimeStamp` column to hour precision in pandas `DataFrame`
                            
                                Why is my plt.savefig is not working?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

binning a dataframe in pandas in Python [duplicate]

Tags:

python

pandas

numpy

People also ask

Recent Activity

Donate For Us