I have a multi-index dataframe and I want to know the percentage of clients who paid a certain threshold of debt for each of the 3 criteria: City, Card and Collateral.
This is a working script:
import pandas as pd
d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Tokyo','Lisbon','Lisbon','Lisbon','Tokyo','Lisbon','Tokyo','Tokyo','Tokyo','Lisbon','Tokyo','Tokyo','Lisbon','Lisbon','Lisbon','Tokyo','Lisbon','Tokyo'],
'Card': ['Visa','Visa','Master Card','Master Card','Visa','Master Card','Visa','Visa','Master Card','Visa','Master Card','Visa','Visa','Master Card','Master Card','Visa','Master Card','Visa','Visa','Master Card','Visa','Master Card'],
'Colateral':['Yes','No','Yes','No','No','No','No','Yes','Yes','No','Yes','Yes','No','Yes','No','No','No','Yes','Yes','No','No','No'],
'Client Number':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
'% Debt Paid':[0.8,0.1,0.5,0.30,0,0.2,0.4,1,0.60,1,0.5,0.2,0,0.3,0,0,0.2,0,0.1,0.70,0.5,0.1]}
df = pd.DataFrame(data=d)
df1 = (df.set_index(['City','Card','Colateral'])
.drop(['Client Number'],axis=1)
.sum(level=[0,1,2]))
df2 = df1.reindex(pd.MultiIndex.from_product(df1.index.levels), fill_value=0)
And this is the result:
To overcome this issue I tried the following without success:
df1 = (df.set_index(['City','Card','Colateral'])
.drop(['Client Number'],axis=1)
[df.Total = 0].count(level=[0,1,2])/[df.Total].count()
[df.Total > 0 & df.Total <=0.25 ].count(level=[0,1,2])/[df.Total].count()
[df.Total > 0.25 & df.Total <=0.5 ].count(level=[0,1,2])/[df.Total])
[df.Total > 0.5 & df.Total <=0.75 ].count(level=[0,1,2])/[df.Total]
[df.Total > 0.75 & df.Total <1 ].count(level=[0,1,2])/[df.Total]
[df.Total = 1].count(level=[0,1,2])/[df.Total]
[df.Total > 1].count(level=[0,1,2])/[df.Total])
df2 = df1.reindex(pd.MultiIndex.from_product(df1.index.levels), fill_value=0)
And this is the result I wish to accomplish for all the criteria. Any thoughts on how to solve this? Thank you.
You can do value_counts
newdf=df.groupby(['City','Card','Colateral'])['% Debt Paid'].\
value_counts(bins=[-0.1,0,0.25,0.5,0.75,1,1.0001,999],normalize=True)
group_cols = ['City', 'Card', 'Colateral']
debt_col = '% Debt Paid'
# (1) Bin the data that is in non-zero-width intervals
bins = pd.IntervalIndex.from_breaks((0, 0.25, 0.5, 0.75, 1, np.inf),
closed='right')
ser_pt1 = df.groupby(group_cols, sort=False)[debt_col]\
.value_counts(bins=bins, sort=False, normalize=True)
# (2) Get the data from zero width intervals (0% and 100%)
ser_pt2 = df[df[debt_col].isin((0, 1))]\
.groupby(group_cols)[debt_col].value_counts()
# Take also "zero counts" and normalize
ser_pt2 = ser_pt2.reindex(
pd.MultiIndex.from_product(ser_pt2.index.levels,
names=ser_pt2.index.names),
fill_value=0) / df.groupby(group_cols)[debt_col].count()
# (3) Combine the results
ser_out = pd.concat([ser_pt1, ser_pt2])
Here's the quick-n-dirty answer. Below is a copy-pasteable full answer which also makes the index names and ordering as requested in the question.
The problem comes more difficult to solve since the bins you want are intersecting. That is, you want to have bin for ]75, 100]
and [100, 100]
, which both should include the case where % Debt Paid
is 1.0
. I would handle two cases separately
(1) Binning for values ]0, 25]%
, ]25, 50]%
, ... ,]100%, np.inf]%
(2) 0%
and 100%
gp[debt_col].value_counts
, which is essentially using pd.Series.value_counts since gp
is a DataFrameGroupBy object and gp[debt_col]
is a SeriesGroupBy object.>100%
is also a bin, with right endpoint at infinity (np.inf
).df[debt_col].isin((0, 1))
to select the 0.0
and 1.0
cases only, and then use value_counts to count the occurences of "0%" and "100%".df.groupby(group_cols)[debt_col].count()
)categories
argument. We rely on the fact that in python 3.6+ dictionaries preserve ordering.ser_out.sort_index()
did not work out even with a categorical index. I am thinking it might be a bug in the pandas. Therefore, the result Series ser_out
is casted to a DataFrame df_out
, and the sorting is made using dataframe.set_index
.Zero-width bins cause the value_counts
to yield really bizarre results. Maybe this is a bug of pandas. Therefore, let's divide the problem into two steps (1) Count the data in the non-zero-width bins (2) Count the data in zero-width bins ("0%" and "100%")
import pandas as pd
import numpy as np
d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Tokyo','Lisbon','Lisbon','Lisbon','Tokyo','Lisbon','Tokyo','Tokyo','Tokyo','Lisbon','Tokyo','Tokyo','Lisbon','Lisbon','Lisbon','Tokyo','Lisbon','Tokyo'],
'Card': ['Visa','Visa','Master Card','Master Card','Visa','Master Card','Visa','Visa','Master Card','Visa','Master Card','Visa','Visa','Master Card','Master Card','Visa','Master Card','Visa','Visa','Master Card','Visa','Master Card'],
'Colateral':['Yes','No','Yes','No','No','No','No','Yes','Yes','No','Yes','Yes','No','Yes','No','No','No','Yes','Yes','No','No','No'],
'Client Number':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
'% Debt Paid':[0.8,0.1,0.5,0.30,0,0.2,0.4,1,0.60,1,0.5,0.2,0,0.3,0,0,0.2,0,0.1,0.70,0.5,0.1]}
df = pd.DataFrame(data=d)
def _get_binned_part(df, group_cols, debt_col):
bins = pd.IntervalIndex.from_breaks((0, 0.25, 0.5, 0.75, 1, np.inf),
closed='right')
gp = df[group_cols + [debt_col]].groupby(group_cols, sort=False)
ser_pt1 = gp[debt_col].value_counts(bins=bins, sort=False, normalize=True)
ser_pt1.index.set_names('bin', level=3, inplace=True)
return ser_pt1
def _get_non_binned_part(df, group_cols, debt_col):
# Count 0% and 100% occurences
ser_pt2 = df[df[debt_col].isin((0, 1))]\
.groupby(group_cols)[debt_col].value_counts()
# include zero counts
ser_pt2 = ser_pt2.reindex(pd.MultiIndex.from_product(
ser_pt2.index.levels, names=ser_pt2.index.names),
fill_value=0)
ser_pt2.index.set_names('bin', level=3, inplace=True)
# ser_counts has the counts for normalization.
ser_counts = df.groupby(group_cols)[debt_col].count()
ser_pt2 = ser_pt2 / ser_counts
return ser_pt2
def _rename_bins(ser_out, group_cols, debt_col):
bin_names = []
bin_name_dict = {
'0.0': '0%',
'(0.0, 0.25]': ']0, 25]%',
'(0.25, 0.5]': ']25, 50]%',
'(0.5, 0.75]': ']50, 75]%',
'(0.75, 1.0]': ']75, 100]%',
'1.0': '100%',
'(1.0, inf]': '>100%',
}
bin_order = list(bin_name_dict.values())
for val in ser_out.index.levels[3].values:
bin_names.append(bin_name_dict.get(val.__str__(), val.__str__()))
bin_categories = pd.Categorical(bin_names,
categories=bin_order,
ordered=True)
ser_out.index.set_levels(bin_categories, level=3, inplace=True)
# For some reason, .sort_index() does not sort correcly
# -> Make it a dataframe and sort there.
df_out = ser_out.reset_index()
df_out['bin'] = pd.Categorical(df_out['bin'].values,
bin_order,
ordered=True)
df_out = df_out.sort_values(group_cols + ['bin']).set_index(group_cols +
['bin'])
df_out.rename(columns={debt_col: 'in_bin'}, inplace=True)
df_out['in_bin'] = (df_out['in_bin'] * 100).round(2)
return df_out
def get_results(df):
group_cols = ['City', 'Card', 'Colateral']
debt_col = '% Debt Paid'
ser_pt1 = _get_binned_part(df, group_cols, debt_col)
ser_pt2 = _get_non_binned_part(df, group_cols, debt_col)
ser_out = pd.concat([ser_pt1, ser_pt2])
df_out = _rename_bins(ser_out, group_cols, debt_col)
return df_out
df_out = get_results(df)
In [1]: df_out
Out[1]:
in_bin
City Card Colateral bin
Lisbon Master Card No 0% 0.00
]0, 25]% 100.00
]25, 50]% 0.00
]50, 75]% 0.00
]75, 100]% 0.00
100% 0.00
>100% 0.00
Yes 0% 0.00
]0, 25]% 0.00
]25, 50]% 100.00
]50, 75]% 0.00
]75, 100]% 0.00
100% 0.00
>100% 0.00
Visa No 0% 0.00
]0, 25]% 0.00
]25, 50]% 66.67
]50, 75]% 0.00
]75, 100]% 33.33
100% 33.33
>100% 0.00
Yes 0% 33.33
]0, 25]% 33.33
]25, 50]% 0.00
]50, 75]% 0.00
]75, 100]% 33.33
100% 33.33
>100% 0.00
Tokyo Master Card No 0% 25.00
]0, 25]% 25.00
]25, 50]% 25.00
]50, 75]% 25.00
]75, 100]% 0.00
100% 0.00
>100% 0.00
Yes 0% 0.00
]0, 25]% 0.00
]25, 50]% 50.00
]50, 75]% 50.00
]75, 100]% 0.00
100% 0.00
>100% 0.00
Visa No 0% 75.00
]0, 25]% 25.00
]25, 50]% 0.00
]50, 75]% 0.00
]75, 100]% 0.00
100% 0.00
>100% 0.00
Yes 0% 0.00
]0, 25]% 50.00
]25, 50]% 0.00
]50, 75]% 0.00
]75, 100]% 50.00
100% 0.00
>100% 0.00
With this combination
In [1]: df.loc[ (df['City'] == 'Lisbon') & (df['Card'] == 'Visa') & (df['Colateral'] == 'No')]
Out[1]:
City Card Colateral Client Number % Debt Paid
6 Lisbon Visa No 7 0.4
9 Lisbon Visa No 10 1.0
20 Lisbon Visa No 21 0.5
the output data table should have
0% 0%
]0, 25]% 0%
]25, 50]% 66.7%
]50, 75]% 0%
]75, 100]% 33.3%
100% 33.3%
>100% 0%
Note that the one intersecting bin pair (]75, 100]
and [100, 100]
) will cause the total sum of the ouput column to be sometimes greater than 100%.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With