Starting from the following dataframe df
:
df = pd.DataFrame({'node':[1,2,3,3,3,5,5],'lang':['it','en','ar','ar','es','uz','es']})
I'm trying to build the structure:
node langs lfreq
0 1 [it] [1]
1 2 [en] [1]
2 3 [ar, es] [2, 1]
3 5 [uz, es] [1, 1]
so basically grouping the lang
elements and frequency per node into a single row through lists. What I've done so far:
# Getting the unique langs / node
a = df.groupby('node')['lang'].unique().reset_index(name='langs')
# Getting the frequency of lang / node
b = df.groupby('node')['lang'].value_counts().reset_index(name='lfreq')
c = b.groupby('node')['lfreq'].unique().reset_index(name='lfreq')
and then merge on node
:
d = pd.merge(a,c,on='node')
After this operations, what I obtained is:
node langs lfreq
0 1 [it] [1]
1 2 [en] [1]
2 3 [ar, es] [2, 1]
3 5 [uz, es] [1]
As you may notice, the last row has only one [1]
occurrence of the frequency of the two [uz, es]
instead of a list of [1,1]
as expected. Is there a way to perform the analysis in a more concise way obtaining the desired output?
I would use the agg function and tolist()
df = pd.DataFrame({'node':[1,2,3,3,3,5,5],'lang':['it','en','ar','ar','es','uz','es']})
# Getting the unique langs / node
a = df.groupby('node')['lang'].unique().reset_index(name='langs')
# Getting the frequency of lang / node
b = df.groupby('node')['lang'].value_counts().reset_index(name='lfreq')
replace
c = b.groupby('node')['lfreq'].unique().reset_index(name='lfreq')
with
c = b.groupby('node').agg({'lfreq': lambda x: x.tolist()}).reset_index()
d = pd.merge(a,c,on='node')
and viola:
node langs lfreq
0 1 [it] [1]
1 2 [en] [1]
2 3 [ar, es] [2, 1]
3 5 [uz, es] [1, 1]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With