pandas: Group by splitting string value in all rows (a column) and aggregation function

Tags:

If i have dataset like this:

id   person_name                       salary
0    [alexander, william, smith]       45000
1    [smith, robert, gates]            65000
2    [bob, alexander]                  56000
3    [robert, william]                 80000
4    [alexander, gates]                70000

If we sum that salary column then we will get 316000

I really want to know how much person who named 'alexander, smith, etc' (in distinct) makes in salary if we sum all of the salaries from its splitting name in this dataset (that contains same string value).

output:

group               sum_salary
alexander           171000 #sum from id 0 + 2 + 4 (which contain 'alexander')
william             125000 #sum from id 0 + 3
smith               110000 #sum from id 0 + 1
robert              145000 #sum from id 1 + 3
gates               135000 #sum from id 1 + 4
bob                  56000 #sum from id 2

as we see the sum of sum_salary columns is not the same as the initial dataset. all because the function requires double counting.

I thought it seems familiar like string count, but what makes me confuse is the way we use aggregation function. I've tried creating a new list of distinct value in person_name columns, then stuck comes.

Any help is appreciated, Thank you very much

868

asked Mar 12 '19 14:03

Izzan Rijal

1 Answers

Solutions working with lists in column person_name:

#if necessary
#df['person_name'] = df['person_name'].str.strip('[]').str.split(', ')

print (type(df.loc[0, 'person_name']))
<class 'list'>

First idea is use defaultdict for store sumed values in loop:

from collections import defaultdict

d = defaultdict(int)
for p, s in zip(df['person_name'], df['salary']):
    for x in p:
        d[x] += int(s)

print (d)
defaultdict(<class 'int'>, {'alexander': 171000, 
                            'william': 125000, 
                            'smith': 110000, 
                            'robert': 145000, 
                            'gates': 135000, 
                            'bob': 56000})

And then:

df1 = pd.DataFrame({'group':list(d.keys()),
                    'sum_salary':list(d.values())})
print (df1)
       group  sum_salary
0  alexander      171000
1    william      125000
2      smith      110000
3     robert      145000
4      gates      135000
5        bob       56000

Another solution with repeating values by length of lists and aggregate sum:

from itertools import chain

df1 = pd.DataFrame({
    'group' : list(chain.from_iterable(df['person_name'].tolist())), 
    'sum_salary' : df['salary'].values.repeat(df['person_name'].str.len())
})

df2 = df1.groupby('group', as_index=False, sort=False)['sum_salary'].sum()
print (df2)
       group  sum_salary
0  alexander      171000
1    william      125000
2      smith      110000
3     robert      145000
4      gates      135000
5        bob       56000

answered Nov 14 '22 21:11

jezrael

Related questions
                            
                                Replace string in PySpark
                            
                                Align text in the putText() in OpenCV
                            
                                How can I fix "Error tokenizing data" on pandas csv reader?
                            
                                How to change languages(translations) dynamically on PyQt5?
                            
                                Find most common string in a 2D list
                            
                                Python TypeError : only integer scalar arrays can be converted to a scalar index
                            
                                Python ValueError: unconverted data remains:
                            
                                "TypeError: Singleton array cannot be considered a valid collection" using sklearn train_test_split
                            
                                TypeError: _transform() takes 2 positional arguments but 3 were given
                            
                                Array: Insert with negative index [duplicate]
                            
                                Transform a 3-column dataframe into a matrix
                            
                                how to fix - error: bad escape \u at position 0
                            
                                Unable to verify secret hash for client at REFRESH_TOKEN_AUTH
                            
                                Find gaps in pandas time series dataframe sampled at 1 minute intervals and fill the gaps with new rows
                            
                                Pyspark 2.4.0, read avro from kafka with read stream - Python
                            
                                Flask-Talisman breaks Flask-Bootstrap
                            
                                How to properly use asyncio.FIRST_COMPLETED
                            
                                Any example of Airflow FileSensor?
                            
                                Python pandas to_csv causes OSError: [Errno 22] Invalid argument
                            
                                Change values in a list using a for loop (python)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pandas: Group by splitting string value in all rows (a column) and aggregation function

Tags:

python

pandas

numpy

Izzan Rijal

People also ask

1 Answers

jezrael

Recent Activity

Donate For Us