Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas merge samed name columns in a dataframe

Tags:

python

pandas

So I have a few CSV files I'm trying to work with, but some of them have multiple columns with the same name.

For example I could have a csv like this:

ID   Name   a    a    a     b    b
1    test1  1    NaN  NaN   "a"  NaN
2    test2  NaN  2    NaN   "a"  NaN
3    test3  2    3    NaN   NaN  "b"
4    test4  NaN  NaN  4     NaN  "b"

loading into pandasis giving me this:

ID   Name   a    a.1  a.2   b    b.1
1    test1  1    NaN  NaN   "a"  NaN
2    test2  NaN  2    NaN   "a"  NaN
3    test3  2    3    NaN   NaN  "b"
4    test4  NaN  NaN  4     NaN  "b"

What I would like to do is merge those same name columns into 1 column (if there are multiple values keeping those values separate) and my ideal output would be this

ID   Name   a      b  
1    test1  "1"    "a"   
2    test2  "2"    "a"
3    test3  "2;3"  "b"
4    test4  "4"    "b"

So wondering if this is possible?

like image 436
Wizuriel Avatar asked Jun 24 '14 15:06

Wizuriel


People also ask

How merge columns with different names pandas?

Different column names are specified for merges in Pandas using the “left_on” and “right_on” parameters, instead of using only the “on” parameter. Merging dataframes with different names for the joining variable is achieved using the left_on and right_on arguments to the pandas merge function.

How do I combine two columns with the same name?

Let's say you want to create a single Full Name column by combining two other columns, First Name and Last Name. To combine first and last names, use the CONCATENATE function or the ampersand (&) operator.


2 Answers

You could use groupby on axis=1, and experiment with something like

>>> def sjoin(x): return ';'.join(x[x.notnull()].astype(str))
>>> df.groupby(level=0, axis=1).apply(lambda x: x.apply(sjoin, axis=1))
  ID   Name        a  b
0  1  test1      1.0  a
1  2  test2      2.0  a
2  3  test3  2.0;3.0  b
3  4  test4      4.0  b

where instead of using .astype(str), you could use whatever formatting operator you wanted.

like image 153
DSM Avatar answered Nov 14 '22 21:11

DSM


Probably it is not a good idea to have duplicated column names, but it will work:

In [72]:

df2=df[['ID', 'Name']]
df2['a']='"'+df.T[df.columns.values=='a'].apply(lambda x: ';'.join(["%i"%item for item in x[x.notnull()]]))+'"' #these columns are of float dtype
df2['b']=df.T[df.columns.values=='b'].apply(lambda x: ';'.join([item for item in x[x.notnull()]])) #these columns are of objects dtype
print df2
   ID   Name      a    b
0   1  test1    "1"  "a"
1   2  test2    "2"  "a"
2   3  test3  "2;3"  "b"
3   4  test4    "4"  "b"

[4 rows x 4 columns]
like image 24
CT Zhu Avatar answered Nov 14 '22 23:11

CT Zhu