Python Pandas: concatenate rows with unique values

Question

In Python pandas I have a large data frame that looks like this:

df = pd.DataFrame ({'a' : ['foo', 'bar'] * 3,
             'b' : ['foo2', 'bar2'] * 3,
             'c' : ['foo3', 'bar3'] * 3,
             'd' : ['q','w','e','r','t','y'],
             'e' : ['q2','w2','e2','r2','t2','y2']})


     a     b     c  d   e
1  bar  bar2  bar3  w  w2
3  bar  bar2  bar3  r  r2
5  bar  bar2  bar3  y  y2
4  foo  foo2  foo3  t  t2
2  foo  foo2  foo3  e  e2
0  foo  foo2  foo3  q  q2

It contains a dozen of columns with duplicated values (a, b, c...) and a few with unique values (d, e). I would like to remove all duplicated values and collect those that are unique, i.e.:

     a     b     c  d   e
1  bar  bar2  bar3  w,r,y  w2,r2,y2
4  foo  foo2  foo3  t,e,q  t2,e2,q2

We can safely assume that unique values are only in 'd' and 'e', while rest is always duplicated.

One way I could conceive a solution would be to groupby all duplicated columns and then apply a concatenation operation on unique values:

df.groupby([df.a, df.b, df.c]).apply(lambda x: "{%s}" % ', '.join(x.d))

One inconvenience is that I have to list all duplicated columns if I want to have them in my output. More of a problem is fact that I am concatenating only strings in 'd', while also 'e' is needed.

Any suggestions?

Roman Pekar · Accepted Answer

I think you can do something like this:

>>> df.groupby(['a', 'b', 'c']).agg(lambda col: ','.join(col))
                   d         e
a   b    c                    
bar bar2 bar3  w,r,y  w2,r2,y2
foo foo2 foo3  q,e,t  q2,e2,t2

Another way to do this and not to list all column but only list ones with unique values

>>> gr_columns = [x for x in df.columns if x not in ['d','e']]
>>> df.groupby(gr_columns).agg(lambda col: ','.join(col))
                   d         e
a   b    c                    
bar bar2 bar3  w,r,y  w2,r2,y2
foo foo2 foo3  q,e,t  q2,e2,t2

MMelnicki · Answer

you could use df.pivot_table(), although it appears to be slightly slower than df.groupby() (as suggested by Roman's answer):

>>> %timeit df.pivot_table(index=['a','b','c'], values=['c','d','e'], aggfunc=lambda x: ','.join(x)).reset_index()
6.17 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

vs

>>> %timeit df.groupby(['a', 'b', 'c']).agg(lambda col: ','.join(col)).reset_index()
4.09 ms ± 95.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Also, if you want the new columns to contain an ACTUAL list (and not a comma-separated list-as-string), you can replace the lambda function ','.join(x) with list(x). And if you want the list to only include unique elements, you can change the lambda function to list(set(x)).

Python Pandas: concatenate rows with unique values

Tags:

python

pandas

Lukasz Tracewski

2 Answers

Roman Pekar

MMelnicki

Recent Activity

Donate For Us

Python Pandas: concatenate rows with unique values

Tags:

python

pandas

Lukasz Tracewski

2 Answers

Roman Pekar

MMelnicki

Related questions

Recent Activity

Donate For Us