I have dataset :
recency;frequency;monetary
21;156;41879955
13;88;16850284
8;74;79150488
2;74;26733719
9;55;16162365
...;...;...
detail raw data -> http://pastebin.com/beiEeS80
and i put into DataFrame
and here is my complete code :
df = pd.DataFrame(datas, columns=['userid', 'recency', 'frequency', 'monetary'])
df['recency'] = df['recency'].astype(float)
df['frequency'] = df['frequency'].astype(float)
df['monetary'] = df['monetary'].astype(float)
df['recency'] = pd.qcut(df['recency'].values, 5).codes + 1
df['frequency'] = pd.qcut(df['frequency'].values, 5).codes + 1
df['monetary'] = pd.qcut(df['monetary'].values, 5).codes + 1
but it's return error
df['frequency'] = pd.qcut(df['frequency'].values, 5).codes + 1
ValueError: Bin edges must be unique: array([ 1., 1., 2., 4., 9., 156.])
How to solve this ?
I ran this in Jupyter and placed the exampledata.txt to the same directory as the notebook.
Please note that the first line:
df = pd.DataFrame(datas, columns=['userid', 'recency', 'frequency', 'monetary'])
loads the colums 'userid'
when it isn't defined in the data file. I removed this column name.
import pandas as pd
def pct_rank_qcut(series, n):
edges = pd.Series([float(i) / n for i in range(n + 1)])
f = lambda x: (edges >= x).argmax()
return series.rank(pct=1).apply(f)
datas = pd.read_csv('./exampledata.txt', delimiter=';')
df = pd.DataFrame(datas, columns=['recency', 'frequency', 'monetary'])
df['recency'] = df['recency'].astype(float)
df['frequency'] = df['frequency'].astype(float)
df['monetary'] = df['monetary'].astype(float)
df['recency'] = pct_rank_qcut(df.recency, 5)
df['frequency'] = pct_rank_qcut(df.frequency, 5)
df['monetary'] = pct_rank_qcut(df.monetary, 5)
The problem you were seeing was a result of pd.qcut assuming 5 bins of equal size. In the data you provided, 'frequency'
has more than 28% number 1's. This broke qcut
.
I provided a new function pct_rank_qcut
that addresses this and pushes all 1's into the first bin.
edges = pd.Series([float(i) / n for i in range(n + 1)])
This line defines a series of percentile edges based on the desired number of bins defined by n
. In the case of n = 5
the edges will be [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
f = lambda x: (edges >= x).argmax()
this line defines a helper function to be applied to another series in the next line. edges >= x
will return a series equal in length to edges
where each element is True
or False
depending on whether x
is less than or equal to that edge. In the case of x = 0.14
the resulting (edges >= x)
will be [False, True, True, True, True, True]
. By the taking the argmax()
I've identified the first index where the series is True
, in this case 1
.
return series.rank(pct=1).apply(f)
This line takes the input series
and turns it into a percentile ranking. I can compare these rankings to the edges I've created and that's why I use the apply(f)
. What's returned should be a series of bin numbers numbered 1 to n. This series of bin numbers is the same thing you were trying to get with:
pd.qcut(df['recency'].values, 5).codes + 1
This has consequences in that the bins are no longer equal and that bin 1 borrows completely from bin 2. But some choice had to be made. If you don't like this choice, use the concept to build your own ranking.
print df.head()
recency frequency monetary
0 3 5 5
1 2 5 5
2 2 5 5
3 1 5 5
4 2 5 5
pd.Series.argmax()
is now deprecated. Simply switch to pd.Series.values.argmax()()
to update!
def pct_rank_qcut(series, n):
edges = pd.Series([float(i) / n for i in range(n + 1)])
f = lambda x: (edges >= x).values.argmax()
return series.rank(pct=1).apply(f)
Various solutions are discussed here, but briefly:
if you are using pandas, >= 0.20.0 they added an option duplicates='raise'|'drop' to control whether to raise on duplicated edges or to drop them, which would result in less bins than specified, and some larger (with more elements) than others.
For previous pandas versions try passing the ranked values instead of the values themselves:
pd.qcut(df['frequency'].rank(method='first').values, 5).codes + 1
In this way you might have that identical values go into different quantiles. This might be correct or not depending on your specific needs (if this is not what you want you probably want to have a look at pandas.cut that chooses the bins to be evenly spaced according to the values themselves, while pandas.qcut chooses the bins so that you have the same number of records in each bin)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With