Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate mean over comma separated column with Pandas?

Tags:

Let's consider the following CSV file test.csv:

"x","y","A","B"
8000000000,"0,1","0.113948,0.113689",0.114042
8000000000,"0,1","0.114063,0.113823",0.114175
8000000000,"0,1","0.114405,0.114366",0.114524
8000000000,"0,1,2,3","0.167543,0.172369,0.419197,0.427285",0.427576
8000000000,"0,1,2,3","0.167784,0.172145,0.418624,0.426492",0.428736
8000000000,"0,1,2,3","0.168121,0.172729,0.419768,0.427467",0.428578

My goal is to group the lines by the columns "x" and "y", and compute the arithmetic mean over the columns "A" and "B".

My first approach was to use a combination of groupby() and mean() in Pandas:

import pandas

if __name__ == "__main__":
    data = pandas.read_csv("test.csv", header=0)
    data = data.groupby(["x", "y"], as_index=False).mean()
    print(data)

Running this script yields the following output:

            x        y         B
0  8000000000      0,1  0.114247
1  8000000000  0,1,2,3  0.428297

As we can see, achieving my goal for the single-valued column "B" is straightforward. However, the column "A" is omitted. Instead, I'd like to have the column "A" with with a string containing the arithmetic mean for each comma-separated value. The desired output should look like this:

            x        y                                    A         B
0  8000000000      0,1                    0.114139,0.113959  0.114247
1  8000000000  0,1,2,3  0.167816,0.172414,0.419196,0.427081  0.428297

Does anybody know how to do this?

like image 420
epic-skyrise-tm Avatar asked Oct 01 '20 17:10

epic-skyrise-tm


1 Answers

You can make a custom aggregate function that parses those strings into lists, finds the averages per column, and formats them back as strings:

def string_mean(rows):
    data_list = []
    for row in rows:
        data_list.append([float(item) for item in row.split(",")])
    data = np.array(data_list)
    return ",".join([f"{item:.6f}" for item in data.mean(axis=0)])
    
df.groupby(["x", "y"], as_index=False).agg({"A": string_mean, "B": "mean"})

which returns

            x        y                                    A         B
0  8000000000      0,1                    0.114139,0.113959  0.114247
1  8000000000  0,1,2,3  0.167816,0.172414,0.419196,0.427081  0.428297

Note it will error if your strings in A have a different number of columns in a single group.

You can probably clean up my function above considerably btw

like image 134
Dan Avatar answered Sep 30 '22 19:09

Dan