Starting from the following dataframe df:
df = pd.DataFrame({'node':[1,2,3,3,3,5,5],'lang':['it','en','ar','ar','es','uz','es']})
I'm trying to build the structure:
    node     langs   lfreq
0      1      [it]     [1]
1      2      [en]     [1]
2      3  [ar, es]  [2, 1]
3      5  [uz, es]  [1, 1]
so basically grouping the lang elements and frequency per node into a single row through lists. What I've done so far:
# Getting the unique langs / node
a = df.groupby('node')['lang'].unique().reset_index(name='langs')
# Getting the frequency of lang / node
b = df.groupby('node')['lang'].value_counts().reset_index(name='lfreq')
c = b.groupby('node')['lfreq'].unique().reset_index(name='lfreq')
and then merge on node:
d = pd.merge(a,c,on='node')
After this operations, what I obtained is:
    node     langs   lfreq
0      1      [it]     [1]
1      2      [en]     [1]
2      3  [ar, es]  [2, 1]
3      5  [uz, es]     [1]
As you may notice, the last row has only one [1] occurrence of the frequency of the two [uz, es] instead of a list of [1,1] as expected. Is there a way to perform the analysis in a more concise way obtaining the desired output?
I would use the agg function and tolist()
df = pd.DataFrame({'node':[1,2,3,3,3,5,5],'lang':['it','en','ar','ar','es','uz','es']})
# Getting the unique langs / node
a = df.groupby('node')['lang'].unique().reset_index(name='langs')
# Getting the frequency of lang / node
b = df.groupby('node')['lang'].value_counts().reset_index(name='lfreq')
replace
c = b.groupby('node')['lfreq'].unique().reset_index(name='lfreq')
with
c = b.groupby('node').agg({'lfreq': lambda x: x.tolist()}).reset_index()
d = pd.merge(a,c,on='node')
and viola:
   node     langs   lfreq
0     1      [it]     [1]
1     2      [en]     [1]
2     3  [ar, es]  [2, 1]
3     5  [uz, es]  [1, 1]
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With