My dataframe has zero as the lowest value. I am trying to use the precision
and include_lowest
parameters of pandas.cut()
, but I can't get the intervals consist of integers rather than floats with one decimal. I can also not get the left most interval to stop at zero.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style='white', font_scale=1.3)
df = pd.DataFrame(range(0,389,8)[:-1], columns=['value'])
df['binned_df_pd'] = pd.cut(df.value, bins=7, precision=0, include_lowest=True)
sns.pointplot(x='binned_df_pd', y='value', data=df)
plt.xticks(rotation=30, ha='right')
I have tried setting precision
to -1, 0 and 1, but they all output one decimal floats. The pandas.cut()
help does mention that the x-min and x-max values are extended with 0.1 % of the x-range, but I thought maybe include_lowest
could suppress this behaviour somehow. My current workaround involves importing numpy:
import numpy as np
bin_counts, edges = np.histogram(df.value, bins=7)
edges = [int(x) for x in edges]
df['binned_df_np'] = pd.cut(df.value, bins=edges, include_lowest=True)
sns.pointplot(x='binned_df_np', y='value', data=df)
plt.xticks(rotation=30, ha='right')
Is there a way to obtain non-negative integers as the interval boundaries directly with pandas.cut()
without using numpy?
Edit: I just noticed that specifying right=False
makes the lowest interval shift to 0 rather than -0.4. It seems to take precedence over include_lowest
, as changing the latter does not have any visible effect in combination with right=False
. The following intervals are still specified with one decimal point.
pandas can represent integer data with possibly missing values using arrays.IntegerArray . This is an extension type implemented within pandas. All NA-like values are replaced with pandas.NA . This array can be stored in a DataFrame or Series like any NumPy array.
In Python pandas binning by distance is achieved by means of the cut() function. We group values related to the column Cupcake into three groups: small, medium and big. In order to do it, we need to calculate the intervals within each group falls.
Series is a one-dimensional labeled array capable of holding data of the type integer, string, float, python objects, etc.
None of the other answers (including OP's np.histogram
workaround) seem to work anymore. They have upvotes, so I'm not sure if something has changed over the years.
IntervalIndex
requires all intervals to be closed identically, so [0, 53]
cannot coexist with (322, 376]
.
Here are two working solutions based on the relabeling approach:
Without numpy, reuse pd.cut
edges as pd.cut
labels
bins = 7
_, edges = pd.cut(df.value, bins=bins, retbins=True)
labels = [f'({abs(edges[i]):.0f}, {edges[i+1]:.0f}]' for i in range(bins)]
df['bin'] = pd.cut(df.value, bins=bins, labels=labels)
# value bin
# 1 8 (0, 53]
# 2 16 (0, 53]
# .. ... ...
# 45 360 (322, 376]
# 46 368 (322, 376]
With numpy, convert np.linspace
edges into pd.cut
labels
bins = 7
edges = np.linspace(df.value.min(), df.value.max(), bins+1).astype(int)
labels = [f'({edges[i]}, {edges[i+1]}]' for i in range(bins)]
df['bin'] = pd.cut(df.value, bins=bins, labels=labels)
# value bin
# 1 8 (0, 53]
# 2 16 (0, 53]
# .. ... ...
# 45 360 (322, 376]
# 46 368 (322, 376]
Note: Only the labels are changed, so the underlying binning will still occur with 0.1% margins.
pointplot()
output (as of pandas 1.2.4):
sns.pointplot(x='bin', y='value', data=df)
plt.xticks(rotation=30, ha='right')
you should specifically set the labels
argument
lower, higher = df['value'].min(), df['value'].max()
n_bins = 7
edges = range(lower, higher, (higher - lower)/n_bins) # the number of edges is 8
lbs = ['(%d, %d]'%(edges[i], edges[i+1]) for i in range(len(edges)-1)]
df['binned_df_pd'] = pd.cut(df.value, bins=n_bins, labels=lbs, include_lowest=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With