Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select and delete columns with duplicate name in pandas DataFrame

Tags:

I have a huge DataFrame, where some columns have the same names. When I try to pick a column that exists twice, (eg del df['col name'] or df2=df['col name']) I get an error. What can I do?

like image 995
user3107640 Avatar asked Dec 16 '13 14:12

user3107640


People also ask

How can I find duplicate columns in pandas?

To find duplicate columns we need to iterate through all columns of a DataFrame and for each and every column it will search if any other column exists in DataFrame with the same contents already. If yes then that column name will be stored in the duplicate column set.

Can a Pandas DataFrame have duplicate column names?

Pandas, however, can be tricked into allowing duplicate column names. Duplicate column names are a problem if you plan to transfer your data set to another statistical language.

How do you remove column names in pandas?

Remove Suffix from column names in Pandas You can use the string rstrip() function or the string replace() function to remove suffix from column names.


2 Answers

You can adress columns by index:

>>> df = pd.DataFrame([[1,2],[3,4],[5,6]], columns=['a','a'])
>>> df
   a  a
0  1  2
1  3  4
2  5  6
>>> df.iloc[:,0]
0    1
1    3
2    5

Or you can rename columns, like

>>> df.columns = ['a','b']
>>> df
   a  b
0  1  2
1  3  4
2  5  6
like image 135
Roman Pekar Avatar answered Sep 20 '22 15:09

Roman Pekar


This is not a good situation to be in. Best would be to create a hierarchical column labeling scheme (Pandas allows for multi-level column labeling or row index labels). Determine what it is that makes the two different columns that have the same name actually different from each other and leverage that to create a hierarchical column index.

In the mean time, if you know the positional location of the columns in the ordered list of columns (e.g. from dataframe.columns) then you can use many of the explicit indexing features, such as .ix[], or .iloc[] to retrieve values from the column positionally.

You can also create copies of the columns with new names, such as:

dataframe["new_name"] = data_frame.ix[:, column_position].values

where column_position references the positional location of the column you're trying to get (not the name).

These may not work for you if the data is too large, however. So best is to find a way to modify the construction process to get the hierarchical column index.

like image 22
ely Avatar answered Sep 17 '22 15:09

ely