Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sum up rows containing exact characters in dataframe

I have a dataframe like this. I want to sum up special rows containing exact characters that matched my targets.

             Ko_EC  FPKM  count
0          1.1.1.1  16.7      1
1         1.1.1.15  30.0      7
2        4.2.1.128  40.5      9
3         4.2.1.12  57.0     10
4  3.2.1.1 1.1.1.1  22.0      4

Here are my dataframe and my targets.

# coding=utf-8
import pandas as pd
import numpy as np


#########
classes = [('1.1.1.1', 16.7, 1),
         ('1.1.1.15', 30, 7),
         ('4.2.1.128', 40.5, 9),
         ('4.2.1.12', 57, 10),
         ('3.2.1.1 1.1.1.1', 22, 4)]
labels = ['Ko_EC','FPKM', 'count']
alls = pd.DataFrame.from_records(classes, columns=labels)

target_list = ['1.1.1.1', '4.2.1.128', '4.2.1.12; 1.1.1.15; 3.2.1.1', '1.1.1.15']

I want to sum up the alls['Ko_EC'] rows containing exact characters that matched the unique target_list.

Based on the answer to a previous question, I used this code:

result = pd.DataFrame()
for target in target_list:
    mask = alls.apply(lambda x: any([cls in target for cls in x['Ko_EC'].split(' ')]),
                      axis=1)
    target_sum = alls.loc[mask,["FPKM", 'count']].sum().reset_index().rename(columns={0:target}).iloc[:,1:]
    result = pd.concat([result,target_sum], axis=1) #concat as result

result = result.T
result.columns = ['FPKM', 'Count']

It results like:

                              FPKM  Count
1.1.1.1                       38.7    5.0
4.2.1.128                     97.5   19.0
4.2.1.12; 1.1.1.15; 3.2.1.1  125.7   22.0
1.1.1.15                      68.7   12.0

What I want is

                              FPKM  Count
1.1.1.1                       38.7    5.0
4.2.1.128                     40.5    9.0
4.2.1.12; 1.1.1.15; 3.2.1.1  109.0   21.0
1.1.1.15                      30.0    7.0

Here, the 4.2.1.128 row FPKM Count only sum that of itself, and 4.2.1.12; 1.1.1.15; 3.2.1.1 rows sum that of 4.2.1.12, 1.1.1.15 and 3.2.1.1 1.1.1.1 in origin alls dataframe, but my code confused the characters 1.1.1.1/1.1.1.15 and 4.2.1.128/4.2.1.12, because of their similarity.

Could anyone tell me how to do this and why my code cannot work? Thanks!

like image 375
yan wang Avatar asked Oct 20 '25 12:10

yan wang


1 Answers

Solution

for each target in target_list, create a regex pattern from target to find the all the rows which contains the target values then sum all such rows and yield the result

def func():
    for target in target_list:
        pat = r'\b(?:%s)\b' % target.replace('; ', '|')
        mask = alls['Ko_EC'].str.contains(pat, na=False)
        yield alls[mask].sum(numeric_only=True).to_frame(target).T


result = pd.concat(func())

Result

                              FPKM  count
1.1.1.1                       38.7    5.0
4.2.1.128                     40.5    9.0
4.2.1.12; 1.1.1.15; 3.2.1.1  109.0   21.0
1.1.1.15                      30.0    7.0
like image 97
Shubham Sharma Avatar answered Oct 23 '25 02:10

Shubham Sharma



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!