Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count if in multiple index Dataframe

I have a multi-index dataframe and I want to know the percentage of clients who paid a certain threshold of debt for each of the 3 criteria: City, Card and Collateral.

This is a working script:

import pandas as pd

d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Tokyo','Lisbon','Lisbon','Lisbon','Tokyo','Lisbon','Tokyo','Tokyo','Tokyo','Lisbon','Tokyo','Tokyo','Lisbon','Lisbon','Lisbon','Tokyo','Lisbon','Tokyo'], 
     'Card': ['Visa','Visa','Master Card','Master Card','Visa','Master Card','Visa','Visa','Master Card','Visa','Master Card','Visa','Visa','Master Card','Master Card','Visa','Master Card','Visa','Visa','Master Card','Visa','Master Card'],
     'Colateral':['Yes','No','Yes','No','No','No','No','Yes','Yes','No','Yes','Yes','No','Yes','No','No','No','Yes','Yes','No','No','No'],
     'Client Number':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
     '% Debt Paid':[0.8,0.1,0.5,0.30,0,0.2,0.4,1,0.60,1,0.5,0.2,0,0.3,0,0,0.2,0,0.1,0.70,0.5,0.1]}

df = pd.DataFrame(data=d)

df1 = (df.set_index(['City','Card','Colateral'])
         .drop(['Client Number'],axis=1)
        .sum(level=[0,1,2]))

df2 = df1.reindex(pd.MultiIndex.from_product(df1.index.levels), fill_value=0)

And this is the result:

enter image description here

To overcome this issue I tried the following without success:


df1 = (df.set_index(['City','Card','Colateral'])
        .drop(['Client Number'],axis=1)
       [df.Total = 0].count(level=[0,1,2])/[df.Total].count()
       [df.Total > 0 & df.Total <=0.25 ].count(level=[0,1,2])/[df.Total].count()
       [df.Total > 0.25 & df.Total <=0.5 ].count(level=[0,1,2])/[df.Total])
       [df.Total > 0.5 & df.Total <=0.75 ].count(level=[0,1,2])/[df.Total]
       [df.Total > 0.75 & df.Total <1 ].count(level=[0,1,2])/[df.Total]
       [df.Total = 1].count(level=[0,1,2])/[df.Total]
       [df.Total > 1].count(level=[0,1,2])/[df.Total])

df2 = df1.reindex(pd.MultiIndex.from_product(df1.index.levels), fill_value=0)

And this is the result I wish to accomplish for all the criteria. Any thoughts on how to solve this? Thank you.

enter image description here

like image 574
Tiago Emanuel Pratas Avatar asked Jul 02 '20 17:07

Tiago Emanuel Pratas


2 Answers

You can do value_counts

newdf=df.groupby(['City','Card','Colateral'])['% Debt Paid'].\
           value_counts(bins=[-0.1,0,0.25,0.5,0.75,1,1.0001,999],normalize=True)
like image 175
BENY Avatar answered Sep 29 '22 19:09

BENY


TL;DR

group_cols = ['City', 'Card', 'Colateral']
debt_col = '% Debt Paid'

# (1) Bin the data that is in non-zero-width intervals
bins = pd.IntervalIndex.from_breaks((0, 0.25, 0.5, 0.75, 1, np.inf),
                                    closed='right')
ser_pt1 = df.groupby(group_cols, sort=False)[debt_col]\
    .value_counts(bins=bins, sort=False, normalize=True)

# (2) Get the data from zero width intervals (0% and 100%)
ser_pt2 = df[df[debt_col].isin((0, 1))]\
        .groupby(group_cols)[debt_col].value_counts()

# Take also "zero counts" and normalize
ser_pt2 = ser_pt2.reindex(
    pd.MultiIndex.from_product(ser_pt2.index.levels,
                               names=ser_pt2.index.names),
    fill_value=0) / df.groupby(group_cols)[debt_col].count()

# (3) Combine the results
ser_out = pd.concat([ser_pt1, ser_pt2])

Here's the quick-n-dirty answer. Below is a copy-pasteable full answer which also makes the index names and ordering as requested in the question.

1. Summary

The problem comes more difficult to solve since the bins you want are intersecting. That is, you want to have bin for ]75, 100] and [100, 100], which both should include the case where % Debt Paid is 1.0. I would handle two cases separately

    (1)   Binning for values ]0, 25]%, ]25, 50]%, ... ,]100%, np.inf]%
    (2)   0% and 100%

2. Description of solution

2.1 Binned part

  • The binned part is calculated using gp[debt_col].value_counts, which is essentially using pd.Series.value_counts since gp is a DataFrameGroupBy object and gp[debt_col] is a SeriesGroupBy object.
  • The bins needed for the value_counts can be created easily from a list of endpoints using pd.IntervalIndex.from_breaks
  • The >100% is also a bin, with right endpoint at infinity (np.inf).

2.2 The rest (0% and 100%)

  • Use the pd.Series.isin at df[debt_col].isin((0, 1)) to select the 0.0 and 1.0 cases only, and then use value_counts to count the occurences of "0%" and "100%".
  • Then, we also need to include the cases where the count is zero. This can be done by reindexing. So, we use pd.Series.reindex to give a row for each ("City", "Card", "Colateral") combination, and form there combinations with pd.MultiIndex.from_product
  • Lastly, we normalize the counts by dividing with the total counts in each group (df.groupby(group_cols)[debt_col].count())

2.3 Renaming

  • Our new index (level 3, called 'bin') is now ready, but to get the to same output as in the OP's question, we need to rename the index labels. This is done just looping over the values and using a "lookup dictionary" for new names
  • The ordering of the labels in the index is by default taken from the numerical/alphabetical ordering but this is not what we want. To force the index order after sorting it, we must use pd.Categorical as the index. The order for sorting is given in the categories argument. We rely on the fact that in python 3.6+ dictionaries preserve ordering.
  • For some reason the ser_out.sort_index() did not work out even with a categorical index. I am thinking it might be a bug in the pandas. Therefore, the result Series ser_out is casted to a DataFrame df_out, and the sorting is made using dataframe.
  • Lastly, the resulting dataframe is made MultiIndex with set_index.

Code

Zero-width bins cause the value_counts to yield really bizarre results. Maybe this is a bug of pandas. Therefore, let's divide the problem into two steps (1) Count the data in the non-zero-width bins (2) Count the data in zero-width bins ("0%" and "100%")

import pandas as pd
import numpy as np

d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Tokyo','Lisbon','Lisbon','Lisbon','Tokyo','Lisbon','Tokyo','Tokyo','Tokyo','Lisbon','Tokyo','Tokyo','Lisbon','Lisbon','Lisbon','Tokyo','Lisbon','Tokyo'], 
     'Card': ['Visa','Visa','Master Card','Master Card','Visa','Master Card','Visa','Visa','Master Card','Visa','Master Card','Visa','Visa','Master Card','Master Card','Visa','Master Card','Visa','Visa','Master Card','Visa','Master Card'],
     'Colateral':['Yes','No','Yes','No','No','No','No','Yes','Yes','No','Yes','Yes','No','Yes','No','No','No','Yes','Yes','No','No','No'],
     'Client Number':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
     '% Debt Paid':[0.8,0.1,0.5,0.30,0,0.2,0.4,1,0.60,1,0.5,0.2,0,0.3,0,0,0.2,0,0.1,0.70,0.5,0.1]}


df = pd.DataFrame(data=d)


def _get_binned_part(df, group_cols, debt_col):
    bins = pd.IntervalIndex.from_breaks((0, 0.25, 0.5, 0.75, 1, np.inf),
                                        closed='right')
    gp = df[group_cols + [debt_col]].groupby(group_cols, sort=False)
    ser_pt1 = gp[debt_col].value_counts(bins=bins, sort=False, normalize=True)
    ser_pt1.index.set_names('bin', level=3, inplace=True)
    return ser_pt1


def _get_non_binned_part(df, group_cols, debt_col):
    # Count 0% and 100% occurences
    ser_pt2 = df[df[debt_col].isin((0, 1))]\
            .groupby(group_cols)[debt_col].value_counts()
    # include zero counts
    ser_pt2 = ser_pt2.reindex(pd.MultiIndex.from_product(
        ser_pt2.index.levels, names=ser_pt2.index.names),
                              fill_value=0)
    ser_pt2.index.set_names('bin', level=3, inplace=True)

    # ser_counts has the counts for normalization.
    ser_counts = df.groupby(group_cols)[debt_col].count()
    ser_pt2 = ser_pt2 / ser_counts

    return ser_pt2


def _rename_bins(ser_out, group_cols, debt_col):
    bin_names = []
    bin_name_dict = {
        '0.0': '0%',
        '(0.0, 0.25]': ']0, 25]%',
        '(0.25, 0.5]': ']25, 50]%',
        '(0.5, 0.75]': ']50, 75]%',
        '(0.75, 1.0]': ']75, 100]%',
        '1.0': '100%',
        '(1.0, inf]': '>100%',
    }
    bin_order = list(bin_name_dict.values())
    for val in ser_out.index.levels[3].values:
        bin_names.append(bin_name_dict.get(val.__str__(), val.__str__()))

    bin_categories = pd.Categorical(bin_names,
                                    categories=bin_order,
                                    ordered=True)
    ser_out.index.set_levels(bin_categories, level=3, inplace=True)

    # For some reason, .sort_index() does not sort correcly
    # -> Make it a dataframe and sort there.
    df_out = ser_out.reset_index()
    df_out['bin'] = pd.Categorical(df_out['bin'].values,
                                   bin_order,
                                   ordered=True)
    df_out = df_out.sort_values(group_cols + ['bin']).set_index(group_cols +
                                                                ['bin'])

    df_out.rename(columns={debt_col: 'in_bin'}, inplace=True)
    df_out['in_bin'] = (df_out['in_bin'] * 100).round(2)

    return df_out


def get_results(df):
    group_cols = ['City', 'Card', 'Colateral']
    debt_col = '% Debt Paid'

    ser_pt1 = _get_binned_part(df, group_cols, debt_col)
    ser_pt2 = _get_non_binned_part(df, group_cols, debt_col)
    ser_out = pd.concat([ser_pt1, ser_pt2])
    df_out = _rename_bins(ser_out, group_cols, debt_col)

    return df_out

df_out = get_results(df)

Example output

In [1]: df_out
Out[1]:
                                         in_bin
City   Card        Colateral bin
Lisbon Master Card No        0%            0.00
                             ]0, 25]%    100.00
                             ]25, 50]%     0.00
                             ]50, 75]%     0.00
                             ]75, 100]%    0.00
                             100%          0.00
                             >100%         0.00
                   Yes       0%            0.00
                             ]0, 25]%      0.00
                             ]25, 50]%   100.00
                             ]50, 75]%     0.00
                             ]75, 100]%    0.00
                             100%          0.00
                             >100%         0.00
       Visa        No        0%            0.00
                             ]0, 25]%      0.00
                             ]25, 50]%    66.67
                             ]50, 75]%     0.00
                             ]75, 100]%   33.33
                             100%         33.33
                             >100%         0.00
                   Yes       0%           33.33
                             ]0, 25]%     33.33
                             ]25, 50]%     0.00
                             ]50, 75]%     0.00
                             ]75, 100]%   33.33
                             100%         33.33
                             >100%         0.00
Tokyo  Master Card No        0%           25.00
                             ]0, 25]%     25.00
                             ]25, 50]%    25.00
                             ]50, 75]%    25.00
                             ]75, 100]%    0.00
                             100%          0.00
                             >100%         0.00
                   Yes       0%            0.00
                             ]0, 25]%      0.00
                             ]25, 50]%    50.00
                             ]50, 75]%    50.00
                             ]75, 100]%    0.00
                             100%          0.00
                             >100%         0.00
       Visa        No        0%           75.00
                             ]0, 25]%     25.00
                             ]25, 50]%     0.00
                             ]50, 75]%     0.00
                             ]75, 100]%    0.00
                             100%          0.00
                             >100%         0.00
                   Yes       0%            0.00
                             ]0, 25]%     50.00
                             ]25, 50]%     0.00
                             ]50, 75]%     0.00
                             ]75, 100]%   50.00
                             100%          0.00
                             >100%         0.00

Appendix

Desired example output: "Lisbon, Visa, No"

With this combination

In [1]: df.loc[ (df['City'] == 'Lisbon') & (df['Card'] == 'Visa') & (df['Colateral'] == 'No')]
Out[1]:
      City  Card Colateral  Client Number  % Debt Paid
6   Lisbon  Visa        No              7          0.4
9   Lisbon  Visa        No             10          1.0
20  Lisbon  Visa        No             21          0.5

the output data table should have

0%            0%
]0, 25]%      0%
]25, 50]%     66.7%
]50, 75]%     0%
]75, 100]%    33.3%
100%          33.3%
>100%         0%

Note that the one intersecting bin pair (]75, 100] and [100, 100]) will cause the total sum of the ouput column to be sometimes greater than 100%.

like image 22
np8 Avatar answered Sep 29 '22 19:09

np8