If i have dataset like this:
id   person_name                       salary
0    [alexander, william, smith]       45000
1    [smith, robert, gates]            65000
2    [bob, alexander]                  56000
3    [robert, william]                 80000
4    [alexander, gates]                70000
If we sum that salary column then we will get 316000
I really want to know how much person who named 'alexander, smith, etc' (in distinct) makes in salary if we sum all of the salaries from its splitting name in this dataset (that contains same string value).
output:
group               sum_salary
alexander           171000 #sum from id 0 + 2 + 4 (which contain 'alexander')
william             125000 #sum from id 0 + 3
smith               110000 #sum from id 0 + 1
robert              145000 #sum from id 1 + 3
gates               135000 #sum from id 1 + 4
bob                  56000 #sum from id 2
as we see the sum of sum_salary columns is not the same as the initial dataset. all because the function requires double counting.
I thought it seems familiar like string count, but what makes me confuse is the way we use aggregation function. I've tried creating a new list of distinct value in person_name columns, then stuck comes.
Any help is appreciated, Thank you very much
Step 1: split the data into groups by creating a groupby object from the original DataFrame; Step 2: apply a function, in this case, an aggregation function that computes a summary statistic (you can also transform or filter your data in this step); Step 3: combine the results into a new DataFrame.
Groupby is a very powerful pandas method. You can group by one column and count the values of another column per this column value using value_counts. Using groupby and value_counts we can count the number of activities each person did.
Split column by delimiter into multiple columns Apply the pandas series str. split() function on the “Address” column and pass the delimiter (comma in this case) on which you want to split the column. Also, make sure to pass True to the expand parameter.
To split text in a column into multiple rows with Python Pandas, we can use the str. split method. to create the df data frame. Then we call str.
Solutions working with lists in column person_name:
#if necessary
#df['person_name'] = df['person_name'].str.strip('[]').str.split(', ')
print (type(df.loc[0, 'person_name']))
<class 'list'>
First idea is use defaultdict for store sumed values in loop:
from collections import defaultdict
d = defaultdict(int)
for p, s in zip(df['person_name'], df['salary']):
    for x in p:
        d[x] += int(s)
print (d)
defaultdict(<class 'int'>, {'alexander': 171000, 
                            'william': 125000, 
                            'smith': 110000, 
                            'robert': 145000, 
                            'gates': 135000, 
                            'bob': 56000})
And then:
df1 = pd.DataFrame({'group':list(d.keys()),
                    'sum_salary':list(d.values())})
print (df1)
       group  sum_salary
0  alexander      171000
1    william      125000
2      smith      110000
3     robert      145000
4      gates      135000
5        bob       56000
Another solution with repeating values by length of lists and aggregate sum:
from itertools import chain
df1 = pd.DataFrame({
    'group' : list(chain.from_iterable(df['person_name'].tolist())), 
    'sum_salary' : df['salary'].values.repeat(df['person_name'].str.len())
})
df2 = df1.groupby('group', as_index=False, sort=False)['sum_salary'].sum()
print (df2)
       group  sum_salary
0  alexander      171000
1    william      125000
2      smith      110000
3     robert      145000
4      gates      135000
5        bob       56000
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With