Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas : pd.concat results in duplicated columns

Tags:

python

pandas

I have a number of large dataframes in a list. I concatenate all of them to produce a single large dataframe.

df_list # This contains a list of dataframes
result = pd.concat(df_list, axis=0)
result.columns.duplicated().any() # This returns True

My expectation was that pd.concat will not produce duplicate columns.

I want to understand when it could result in duplicate columns so that I can debug the source.

I could not reproduce the problem with a toy dataset.

I have verified that the input data frames have unique columns by running df.columns.duplicated().any().

The pandas version used 1.0.1

(Pdb) p result_data[0].columns.duplicated().any()
False
(Pdb) p result_data[1].columns.duplicated().any()
False
(Pdb) p result_data[2].columns.duplicated().any()
False
(Pdb) p result_data[3].columns.duplicated().any()
False
(Pdb) p pd.concat(result_data[0:4]).columns.duplicated().any()
True
like image 352
Suresh Avatar asked Apr 30 '20 02:04

Suresh


People also ask

How do I get rid of duplicate columns after Merge pandas?

merge() function to join the two data frames by inner join. Now, add a suffix called 'remove' for newly joined columns that have the same name in both data frames. Use the drop() function to remove the columns with the suffix 'remove'. This will ensure that identical columns don't exist in the new dataframe.

Does Panda concat remove duplicates?

Output: As shown in the output image, we get the concatenation of dataframes without removing duplicates.

Is PD concat faster than PD append?

In this benchmark, concatenating multiple dataframes by using the Pandas. concat function is 50 times faster than using the DataFrame. append version.

What is the difference between PD concat and PD merge?

The main difference between merge & concat is that merge allow you to perform more structured "join" of tables where use of concat is more broad and less structured.


1 Answers

Check the below behaviour:

In [452]: df1 = pd.DataFrame({'A':[1,2,3], 'B':[2,3,4]})                                                                                                                                                    

In [468]: df2 = pd.DataFrame({'A':[1,2,3], 'B':[2,4,5]})

In [460]: df_list = [df1,df2]

This concats and keeps duplicate columns:

In [463]: pd.concat(df_list, axis=1)                                                                                                                                                                        
Out[474]: 
   A  B  A  B
0  1  2  1  2
1  2  3  2  4
2  3  4  3  5

pd.concat always concatenates the dataframes as is. It does not drop duplicate columns at all.

If you concatenate without the axis, it will append one dataframe below another in the same columns.

So you can have duplicate rows now, but not columns.

In [477]: pd.concat(df_list)                                                                                                                                                                                
Out[477]: 
   A  B
0  1  2  ## duplicate row
1  2  3
2  3  4
0  1  2  ## duplicate row
1  2  4
2  3  5

You can remove these duplicate rows by using drop_duplicates():

In [478]: pd.concat(df_list).drop_duplicates()                                                                                                                                                              
Out[478]: 
   A  B
0  1  2
1  2  3
2  3  4
1  2  4
2  3  5

Update after OP's comment:

In [507]: df_list[0].columns.duplicated().any()                                                                                                                                                             
Out[507]: False

In [508]: df_list[1].columns.duplicated().any()                                                                                                                                                             
Out[508]: False

In [510]: pd.concat(df_list[0:2]).columns.duplicated().any()                                                                                                                                                
Out[510]: False
like image 77
Mayank Porwal Avatar answered Sep 30 '22 19:09

Mayank Porwal