Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple aggregated Counting in Pandas

I have a DF:

data = [["John","144","Smith","200"], ["Mia","220","John","144"],["Caleb","155","Smith","200"],["Smith","200","Jason","500"]]
data_frame = pd.DataFrame(data,columns = ["Name","ID","Manager_name","Manager_ID"])
data_frame

OP:

    Name    ID    Manager_name  Manager_ID
  0 John    144    Smith             200
  1 Mia     220    John              144
  2 Caleb   155    Smith             200
  3 Smith   200    Jason             500

I am trying to count the number of people reporting under each person in the column Name.

Logic is:

Count the number of people reporting individually and people reporting under in the chain. For example with Smith; John and Caleb reports to Smith so 2 + 1 with Mia reporting to John (who already reports to Smith) so total 3.

Similarly for Jason -> 1 because Smith reports to him and 3 people already report to Smith so total 4.

I understand how to do it pythonically with some recursion, is there a way to efficiently do it in Pandas. Any suggestions?

Expected OP:

Name        Number of people reporting
John               1
Mia                0
Caleb              0
Smith              3
Jason              4
like image 712
Vijay Dahiya Avatar asked Apr 30 '19 04:04

Vijay Dahiya


2 Answers

Scott Boston's Networkx solution is the preferred solution...

There are two solutions to this problem. The first one is a vectorized pandas type solution and should be fast over larger datasets, the second is pythonic and does not work well on the size of dataset the OP was looking for, the original df size is (223635,4).

  1. PANDAS SOLUTION

This problem seeks to find out how many people each person in an organization manages, including subordinate's subordinates. This solution will create a dataframe by adding successive columns that are the managers of the previous columns, and then counting the occurance of each employee in that dataframe to determine the total number under them.

First we set up the input.

import pandas as pd
import numpy as np

data = [
    ["John", "144", "Smith", "200"],
    ["Mia", "220", "John", "144"],
    ["Caleb", "155", "Smith", "200"],
    ["Smith", "200", "Jason", "500"],
]
df = pd.DataFrame(data, columns=["Name", "SID", "Manager_name", "Manager_SID"])

df = df[["SID", "Manager_SID"]]

# shortening the columns for convenience
df.columns = ["1", "2"]

print(df)

     1    2
0  144  200
1  220  144
2  155  200
3  200  500

First the employees without subordinates must be counted and put into a seperate dictionary.

df_not_mngr = df.loc[~df['1'].isin(df['2']), '1']
non_mngr_dict = {str(key):0 for key in df_not_mngr.values}
non_mngr_dict

{'220': 0, '155': 0}

Next we will modify the dataframe by adding columns of managers of the previous column. The loop is stopped when there are no employees in the right most column

for i in range(2, 10):
    df = df.merge(
        df[["1", "2"]], how="left", left_on=str(i), right_on="1", suffixes=("_l", "_r")
    ).drop("1_r", axis=1)
    df.columns = [str(x) for x in range(1, i + 2)]
    if df.iloc[:, -1].isnull().all():
        break
    else:
        continue
print(df)

    1    2    3    4    5
0  144  200  500  NaN  NaN
1  220  144  200  500  NaN
2  155  200  500  NaN  NaN
3  200  500  NaN  NaN  NaN

All columns except the first columns are collapsed and each employee counted and added to a dictionary.

from collections import Counter

result = dict(Counter(df.iloc[:, 1:].values.flatten()))

The non manager dictionary is added to the result.

result.update(non_mngr_dict)
result

{'200': 3, '500': 4, nan: 8, '144': 1, '220': 0, '155': 0}
  1. RECURSIVE PYTHONIC SOLUTION

I think this is probably way more pythonic than you were looking for. First I created a list 'all_sids' to make sure we capture all employees as not all are in each list.

import pandas as pd
import numpy as np

data = [
    ["John", "144", "Smith", "200"],
    ["Mia", "220", "John", "144"],
    ["Caleb", "155", "Smith", "200"],
    ["Smith", "200", "Jason", "500"],
]
df = pd.DataFrame(data, columns=["Name", "SID", "Manager_name", "Manager_SID"])

all_sids = pd.unique(df[['SID', 'Manager_SID']].values.ravel('K'))

Then create a pivot table.

dfp = df.pivot_table(values='Name', index='SID', columns='Manager_SID', aggfunc='count')

dfp

Manager_SID  144  200  500
SID                       
144          NaN  1.0  NaN
155          NaN  1.0  NaN
200          NaN  NaN  1.0
220          1.0  NaN  NaN

Then a function that will go through the pivot table to total up all the reports.

def count_mngrs(SID, count=0):
    if str(SID) not in dfp.columns:
        return count
    else:
        count += dfp[str(SID)].sum()
        sid_list = dfp[dfp[str(SID)].notnull()].index
        for sid in sid_list:
            count = count_mngrs(sid, count)
        return count

Call the function for each employee and print the results.

print('SID', '     Number of People Reporting')
for sid in all_sids: 
    print(sid, "     " , int(count_mngrs(sid)))

Results are below, sorry I was a bit lazy in putting the names with the sids.

SID      Number of People Reporting
144       1
220       0
155       0
200       3
500       4

Look forward to seeing a more pandas type solution!

like image 75
run-out Avatar answered Oct 08 '22 13:10

run-out


This is also, a graph problem and you can use Networkx:

import networkx as nx
import pandas as pd
data = [["John","144","Smith","200"], ["Mia","220","John","144"],["Caleb","155","Smith","200"],["Smith","200","Jason","500"]]
data_frame = pd.DataFrame(data,columns = ["Name","ID","Manager_name","Manager_ID"])

#create a directed graph object using nx.DiGraph
G = nx.from_pandas_edgelist(data_frame, 
                            source='Name', 
                            target='Manager_name', 
                            create_using=nx.DiGraph())

#use nx.ancestors to get set of "ancenstor" nodes for each node in the directed graph
pd.DataFrame.from_dict({i:len(nx.ancestors(G,i)) for i in G.nodes()}, 
                       orient='index', 
                       columns=['Num of People reporting'])

Output:

       Num of People reporting
John                         1
Smith                        3
Mia                          0
Caleb                        0
Jason                        4

Draw newtorkx:

enter image description here

like image 34
Scott Boston Avatar answered Oct 08 '22 15:10

Scott Boston