I have a data like the following:
NAME ETHNICITY_RECAT TOTAL_LENGTH 3LETTER_SUBSTRINGS
joseph fr 14 jos, ose, sep, eph
ann en 16 ann
anne ir 14 ann, nne
tom en 18 tom
tommy fr 16 tom, omm, mmy
ann ir 19 ann
... more rows
The 3LETTER_SUBSTRINGS values are string which captures all the 3-letter substrings of the NAME variable. I would like to aggregate it into a single list, with each comma-separated item appended to the list by each row, and to be considered as a single list item. As follows:
ETHNICITY_RECAT TOTAL_LENGTH 3LETTER_SUBSTRINGS
min max mean <lambda>
fr 2 26 13.22 [jos, ose, sep, eph, tom, oom, mmy, ...]
en 3 24 11.92 [ann, tom, ...]
ir 4 23 12.03 [ann, nne, ann, ...]
I kind of "did" it using the following code:
aggregations = {
'TOTAL_LENGTH': [min, max, 'mean'],
'3LETTER_SUBSTRINGS': lambda x: list(x),
}
self.df_agg = self.df.groupby('ETHNICITY_RECAT', as_index=False).agg(aggregations)
The problem is the whole string "ann, anne" is considered one single list item in the final list, instead of considering each as single list item, such as "ann", "anne".
I would like to see the highest frequency of the substrings, but now I am getting the frequency of the whole string (instead of the individual 3-letter substring), when I run the following code:
from collections import Counter
x = self.df_agg_eth[self.df_agg_eth['ETHNICITY_RECAT']=='en']['3LETTER_SUBSTRINGS']['<lambda>']
x_list = x[0]
c = Counter(x_list)
I get this:
[('jos, ose, sep, eph', 19), ('ann, nee', 5), ...]
Instead of what I want:
[('jos', 19), ('ose', 19), ('sep', 23), ('eph', 19), ('ann', 15), ('nee', 5), ...]
I tried:
'3LETTER_SUBSTRINGS': lambda x: list(i) for i in x.split(', '),
But it says invalid syntax
.
First thing you want to do is to convert the string into list, then it's just a groupby
with agg
:
df['3LETTER_SUBSTRINGS'] = df['3LETTER_SUBSTRINGS'].str.split(', ')
df.groupby('ETHNICITY_RECAT').agg({'TOTAL_LENGTH':['min','max','mean'],
'3LETTER_SUBSTRINGS':'sum'})
Output:
TOTAL_LENGTH 3LETTER_SUBSTRINGS
min max mean sum
ETHNICITY_RECAT
en 16 18 17.0 [ann, tom]
fr 14 16 15.0 [jos, ose, sep, eph, tom, omm, mmy]
ir 14 19 16.5 [ann, nne, ann]
I think most of your code is alright, you just misinterpreted the error: it has nothing to do with string conversion. You have lists/tuples in each cell of the 3LETTER_SUBSTRING column. When you use the lambda x:list(x)
function, you create a list of tuples. Hence there is nothing like split(",")
to do and going to cast to string and back to table ...
Instead, you just need to unnest your table when you create your new list. So here's a small reproducible code: (note that I focused on your tuple/aggregation issue as I'm sure you will quickly find the rest of the code)
import pandas as pd
# Create some data
names = [("joseph","fr"),("ann","en"),("anne","ir"),("tom","en"),("tommy","fr"),("ann","fr")]
df = pd.DataFrame(names, columns=["NAMES","ethnicity"])
df["3LETTER_SUBSTRING"] = df["NAMES"].apply(lambda name: [name[i:i+3] for i in range(len(name) - 2)])
print(df)
# Aggregate the 3LETTER per ethnicity, and unnest the result in a new table for each ethnicity:
df.groupby('ethnicity').agg({
"3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})
Using the counter you specify, I got
dfg = df.groupby('ethnicity', as_index=False).agg({
"3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})
from collections import Counter
print(Counter(dfg[dfg["ethnicity"] == "en"]["3LETTER_SUBSTRING"][0]))
# Counter({'ann': 1, 'tom': 1})
To get it as a list of tuples, just use a dictionary built-in function such as dict.items()
.
UPDATE : using preformated string list as in the question:
import pandas as pd
# Create some data
names = [("joseph","fr","jos, ose, sep, eph"),("ann","en","ann"),("anne","ir","ann, nne"),("tom","en","tom"),("tommy","fr","tom, omm, mmy"),("ann","fr","ann")]
df = pd.DataFrame(names, columns=["NAMES","ethnicity","3LETTER_SUBSTRING"])
def transform_3_letter_to_table(x):
"""
Update this function with regard to your data format
"""
return x.split(", ")
df["3LETTER_SUBSTRING"] = df["3LETTER_SUBSTRING"].apply(transform_3_letter_to_table)
print(df)
# Applying aggregation
dfg = df.groupby('ethnicity', as_index=False).agg({
"3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})
print(dfg)
# test on some data
from collections import Counter
c = Counter(dfg[dfg["ethnicity"] == "en"]["3LETTER_SUBSTRING"][0])
print(c)
print(list(c.items()))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With