Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to aggregate string with comma-separated items of a column into a list with Pandas groupby()?

I have a data like the following:

NAME    ETHNICITY_RECAT TOTAL_LENGTH    3LETTER_SUBSTRINGS
joseph  fr              14              jos, ose, sep, eph
ann     en              16              ann
anne    ir              14              ann, nne
tom     en              18              tom
tommy   fr              16              tom, omm, mmy
ann     ir              19              ann
... more rows

The 3LETTER_SUBSTRINGS values are string which captures all the 3-letter substrings of the NAME variable. I would like to aggregate it into a single list, with each comma-separated item appended to the list by each row, and to be considered as a single list item. As follows:

ETHNICITY_RECAT TOTAL_LENGTH            3LETTER_SUBSTRINGS
                min max mean            <lambda>
fr              2   26  13.22           [jos, ose, sep, eph, tom, oom, mmy, ...]
en              3   24  11.92           [ann, tom, ...]
ir              4   23  12.03           [ann, nne, ann, ...]

I kind of "did" it using the following code:

aggregations = {
    'TOTAL_LENGTH': [min, max, 'mean'], 
    '3LETTER_SUBSTRINGS': lambda x: list(x),
    }

self.df_agg = self.df.groupby('ETHNICITY_RECAT', as_index=False).agg(aggregations)

The problem is the whole string "ann, anne" is considered one single list item in the final list, instead of considering each as single list item, such as "ann", "anne".

I would like to see the highest frequency of the substrings, but now I am getting the frequency of the whole string (instead of the individual 3-letter substring), when I run the following code:

from collections import Counter 
x = self.df_agg_eth[self.df_agg_eth['ETHNICITY_RECAT']=='en']['3LETTER_SUBSTRINGS']['<lambda>']
x_list = x[0]
c = Counter(x_list)

I get this:

[('jos, ose, sep, eph', 19), ('ann, nee', 5), ...]

Instead of what I want:

[('jos', 19), ('ose', 19), ('sep', 23), ('eph', 19), ('ann', 15), ('nee', 5), ...]

I tried:

'3LETTER_SUBSTRINGS': lambda x: list(i) for i in x.split(', '),

But it says invalid syntax.

like image 588
KubiK888 Avatar asked Oct 15 '22 09:10

KubiK888


2 Answers

First thing you want to do is to convert the string into list, then it's just a groupby with agg:

df['3LETTER_SUBSTRINGS'] = df['3LETTER_SUBSTRINGS'].str.split(', ')

df.groupby('ETHNICITY_RECAT').agg({'TOTAL_LENGTH':['min','max','mean'],
                                   '3LETTER_SUBSTRINGS':'sum'})

Output:

                TOTAL_LENGTH                             3LETTER_SUBSTRINGS
                         min max  mean                                  sum
ETHNICITY_RECAT                                                            
en                        16  18  17.0                           [ann, tom]
fr                        14  16  15.0  [jos, ose, sep, eph, tom, omm, mmy]
ir                        14  19  16.5                      [ann, nne, ann]
like image 139
Quang Hoang Avatar answered Oct 21 '22 04:10

Quang Hoang


I think most of your code is alright, you just misinterpreted the error: it has nothing to do with string conversion. You have lists/tuples in each cell of the 3LETTER_SUBSTRING column. When you use the lambda x:list(x) function, you create a list of tuples. Hence there is nothing like split(",") to do and going to cast to string and back to table ...

Instead, you just need to unnest your table when you create your new list. So here's a small reproducible code: (note that I focused on your tuple/aggregation issue as I'm sure you will quickly find the rest of the code)

import pandas as pd
# Create some data
names = [("joseph","fr"),("ann","en"),("anne","ir"),("tom","en"),("tommy","fr"),("ann","fr")]
df = pd.DataFrame(names, columns=["NAMES","ethnicity"])
df["3LETTER_SUBSTRING"] = df["NAMES"].apply(lambda name: [name[i:i+3] for i in range(len(name) - 2)])
print(df)
# Aggregate the 3LETTER per ethnicity, and unnest the result in a new table for each ethnicity:
df.groupby('ethnicity').agg({
    "3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})

Using the counter you specify, I got

dfg = df.groupby('ethnicity', as_index=False).agg({
    "3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})
from collections import Counter
print(Counter(dfg[dfg["ethnicity"] == "en"]["3LETTER_SUBSTRING"][0]))
# Counter({'ann': 1, 'tom': 1})

To get it as a list of tuples, just use a dictionary built-in function such as dict.items().


UPDATE : using preformated string list as in the question:

import pandas as pd
# Create some data
names = [("joseph","fr","jos, ose, sep, eph"),("ann","en","ann"),("anne","ir","ann, nne"),("tom","en","tom"),("tommy","fr","tom, omm, mmy"),("ann","fr","ann")]
df = pd.DataFrame(names, columns=["NAMES","ethnicity","3LETTER_SUBSTRING"])
def transform_3_letter_to_table(x):
    """
    Update this function with regard to your data format
    """
    return x.split(", ")
df["3LETTER_SUBSTRING"] = df["3LETTER_SUBSTRING"].apply(transform_3_letter_to_table)
print(df)
# Applying aggregation
dfg = df.groupby('ethnicity', as_index=False).agg({
    "3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})
print(dfg)
# test on some data
from collections import Counter
c = Counter(dfg[dfg["ethnicity"] == "en"]["3LETTER_SUBSTRING"][0])
print(c)
print(list(c.items()))
like image 20
Théophile Pace Avatar answered Oct 21 '22 04:10

Théophile Pace