In Python pandas I have a large data frame that looks like this:
df = pd.DataFrame ({'a' : ['foo', 'bar'] * 3,
             'b' : ['foo2', 'bar2'] * 3,
             'c' : ['foo3', 'bar3'] * 3,
             'd' : ['q','w','e','r','t','y'],
             'e' : ['q2','w2','e2','r2','t2','y2']})
     a     b     c  d   e
1  bar  bar2  bar3  w  w2
3  bar  bar2  bar3  r  r2
5  bar  bar2  bar3  y  y2
4  foo  foo2  foo3  t  t2
2  foo  foo2  foo3  e  e2
0  foo  foo2  foo3  q  q2
It contains a dozen of columns with duplicated values (a, b, c...) and a few with unique values (d, e). I would like to remove all duplicated values and collect those that are unique, i.e.:
     a     b     c  d   e
1  bar  bar2  bar3  w,r,y  w2,r2,y2
4  foo  foo2  foo3  t,e,q  t2,e2,q2
We can safely assume that unique values are only in 'd' and 'e', while rest is always duplicated.
One way I could conceive a solution would be to groupby all duplicated columns and then apply a concatenation operation on unique values:
df.groupby([df.a, df.b, df.c]).apply(lambda x: "{%s}" % ', '.join(x.d))
One inconvenience is that I have to list all duplicated columns if I want to have them in my output. More of a problem is fact that I am concatenating only strings in 'd', while also 'e' is needed.
Any suggestions?
I think you can do something like this:
>>> df.groupby(['a', 'b', 'c']).agg(lambda col: ','.join(col))
                   d         e
a   b    c                    
bar bar2 bar3  w,r,y  w2,r2,y2
foo foo2 foo3  q,e,t  q2,e2,t2
Another way to do this and not to list all column but only list ones with unique values
>>> gr_columns = [x for x in df.columns if x not in ['d','e']]
>>> df.groupby(gr_columns).agg(lambda col: ','.join(col))
                   d         e
a   b    c                    
bar bar2 bar3  w,r,y  w2,r2,y2
foo foo2 foo3  q,e,t  q2,e2,t2
                        you could use df.pivot_table(), although it appears to be slightly slower than df.groupby() (as suggested by Roman's answer):
>>> %timeit df.pivot_table(index=['a','b','c'], values=['c','d','e'], aggfunc=lambda x: ','.join(x)).reset_index()
6.17 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
vs
>>> %timeit df.groupby(['a', 'b', 'c']).agg(lambda col: ','.join(col)).reset_index()
4.09 ms ± 95.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Also, if you want the new columns to contain an ACTUAL list (and not a comma-separated list-as-string), you can replace the lambda function ','.join(x) with list(x).
And if you want the list to only include unique elements, you can change the lambda function to list(set(x)).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With