With `pandas.cut()`, how do I get integer bins and avoid getting a negative lowest bound?

Tags:

My dataframe has zero as the lowest value. I am trying to use the precision and include_lowest parameters of pandas.cut(), but I can't get the intervals consist of integers rather than floats with one decimal. I can also not get the left most interval to stop at zero.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style='white', font_scale=1.3)

df = pd.DataFrame(range(0,389,8)[:-1], columns=['value'])
df['binned_df_pd'] = pd.cut(df.value, bins=7, precision=0, include_lowest=True)
sns.pointplot(x='binned_df_pd', y='value', data=df)
plt.xticks(rotation=30, ha='right')

enter image description here

I have tried setting precision to -1, 0 and 1, but they all output one decimal floats. The pandas.cut() help does mention that the x-min and x-max values are extended with 0.1 % of the x-range, but I thought maybe include_lowest could suppress this behaviour somehow. My current workaround involves importing numpy:

import numpy as np

bin_counts, edges = np.histogram(df.value, bins=7)
edges = [int(x) for x in edges]
df['binned_df_np'] = pd.cut(df.value, bins=edges, include_lowest=True)

sns.pointplot(x='binned_df_np', y='value', data=df)
plt.xticks(rotation=30, ha='right')

enter image description here

Is there a way to obtain non-negative integers as the interval boundaries directly with pandas.cut() without using numpy?

Edit: I just noticed that specifying right=False makes the lowest interval shift to 0 rather than -0.4. It seems to take precedence over include_lowest, as changing the latter does not have any visible effect in combination with right=False. The following intervals are still specified with one decimal point.

enter image description here

259

asked Sep 13 '15 16:09

joelostblom

2 Answers

None of the other answers (including OP's np.histogram workaround) seem to work anymore. They have upvotes, so I'm not sure if something has changed over the years.

IntervalIndex requires all intervals to be closed identically, so [0, 53] cannot coexist with (322, 376].

Here are two working solutions based on the relabeling approach:

Without numpy, reuse pd.cut edges as pd.cut labels

bins = 7

_, edges = pd.cut(df.value, bins=bins, retbins=True)
labels = [f'({abs(edges[i]):.0f}, {edges[i+1]:.0f}]' for i in range(bins)]

df['bin'] = pd.cut(df.value, bins=bins, labels=labels)

#     value         bin
# 1       8     (0, 53]
# 2      16     (0, 53]
# ..    ...         ...
# 45    360  (322, 376]
# 46    368  (322, 376]

With numpy, convert np.linspace edges into pd.cut labels

bins = 7

edges = np.linspace(df.value.min(), df.value.max(), bins+1).astype(int)
labels = [f'({edges[i]}, {edges[i+1]}]' for i in range(bins)]

df['bin'] = pd.cut(df.value, bins=bins, labels=labels)

#     value         bin
# 1       8     (0, 53]
# 2      16     (0, 53]
# ..    ...         ...
# 45    360  (322, 376]
# 46    368  (322, 376]

Note: Only the labels are changed, so the underlying binning will still occur with 0.1% margins.

pointplot() output (as of pandas 1.2.4):

sns.pointplot(x='bin', y='value', data=df)
plt.xticks(rotation=30, ha='right')

answered Sep 18 '22 08:09

tdy

you should specifically set the labels argument

preparations:

lower, higher = df['value'].min(), df['value'].max()
n_bins = 7

build up the labels:

edges = range(lower, higher, (higher - lower)/n_bins) # the number of edges is 8
lbs = ['(%d, %d]'%(edges[i], edges[i+1]) for i in range(len(edges)-1)]

set labels:

df['binned_df_pd'] = pd.cut(df.value, bins=n_bins, labels=lbs, include_lowest=True)

answered Sep 20 '22 08:09

PeterLai

Related questions
                            
                                Python module and object names clash
                            
                                How to redirect stdout to a Tkinter Text widget
                            
                                Python is there an ORM for SQL and NoSQL
                            
                                Geometric warp of image in python
                            
                                In pandas is there something like a GroupBy.get_group, but with an optional default value?
                            
                                How to read from a text file compressed with 7z?
                            
                                Count how many matrices have full rank for all submatrices
                            
                                Mock a class in Python in order to call one of its methods
                            
                                Python in emacs: __name__ == '__main__', but somehow not
                            
                                How to unittest that a thread is spawned?
                            
                                Behaviour of custom NaN floats in Python and Numpy
                            
                                IPython keyboard interrupt CTRL + C inconsistent
                            
                                pythoncom crashes on KeyDown when used hooked to certain applications
                            
                                Using `concurrent.futures.Future` as promise
                            
                                Using scipy.optimize.curve_fit with weights
                            
                                Comparison of R, statmodels, sklearn for a classification task with logistic regression
                            
                                parsing age of empires game record files(.mgx)
                            
                                Django - Disable form select field validation
                            
                                Multiple default values specified for column "id" of the table
                            
                                How can I keep python from loading the 'wrong' package?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

With `pandas.cut()`, how do I get integer bins and avoid getting a negative lowest bound?

Tags:

python

pandas

intervals

bins

joelostblom

People also ask

2 Answers

tdy

preparations:

build up the labels:

set labels:

PeterLai

Recent Activity

Donate For Us