pandas : pd.concat results in duplicated columns

Tags:

python

pandas

I have a number of large dataframes in a list. I concatenate all of them to produce a single large dataframe.

df_list # This contains a list of dataframes
result = pd.concat(df_list, axis=0)
result.columns.duplicated().any() # This returns True

My expectation was that pd.concat will not produce duplicate columns.

I want to understand when it could result in duplicate columns so that I can debug the source.

I could not reproduce the problem with a toy dataset.

I have verified that the input data frames have unique columns by running df.columns.duplicated().any().

The pandas version used 1.0.1

(Pdb) p result_data[0].columns.duplicated().any()
False
(Pdb) p result_data[1].columns.duplicated().any()
False
(Pdb) p result_data[2].columns.duplicated().any()
False
(Pdb) p result_data[3].columns.duplicated().any()
False
(Pdb) p pd.concat(result_data[0:4]).columns.duplicated().any()
True

352

asked Apr 30 '20 02:04

Suresh

1 Answers

Check the below behaviour:

In [452]: df1 = pd.DataFrame({'A':[1,2,3], 'B':[2,3,4]})                                                                                                                                                    

In [468]: df2 = pd.DataFrame({'A':[1,2,3], 'B':[2,4,5]})

In [460]: df_list = [df1,df2]

This concats and keeps duplicate columns:

In [463]: pd.concat(df_list, axis=1)                                                                                                                                                                        
Out[474]: 
   A  B  A  B
0  1  2  1  2
1  2  3  2  4
2  3  4  3  5

pd.concat always concatenates the dataframes as is. It does not drop duplicate columns at all.

If you concatenate without the axis, it will append one dataframe below another in the same columns.

So you can have duplicate rows now, but not columns.

In [477]: pd.concat(df_list)                                                                                                                                                                                
Out[477]: 
   A  B
0  1  2  ## duplicate row
1  2  3
2  3  4
0  1  2  ## duplicate row
1  2  4
2  3  5

You can remove these duplicate rows by using drop_duplicates():

In [478]: pd.concat(df_list).drop_duplicates()                                                                                                                                                              
Out[478]: 
   A  B
0  1  2
1  2  3
2  3  4
1  2  4
2  3  5

Update after OP's comment:

In [507]: df_list[0].columns.duplicated().any()                                                                                                                                                             
Out[507]: False

In [508]: df_list[1].columns.duplicated().any()                                                                                                                                                             
Out[508]: False

In [510]: pd.concat(df_list[0:2]).columns.duplicated().any()                                                                                                                                                
Out[510]: False

answered Sep 30 '22 19:09

Mayank Porwal

Related questions
                            
                                ImportError: cannot import name 'Serial' from 'serial' (unknown location)
                            
                                Reserved word as an attribute name in a dataclass when parsing a JSON object
                            
                                Cant create CSV file with django although already copaste from the documentation
                            
                                Multiprocessing in a loop, "Pool not running" error
                            
                                Python loses connection to MySQL database after about a day
                            
                                Python requirements conflict with PyPi
                            
                                AWS Cognito for Django3 + DRF Authentication
                            
                                What are the inputs to the transformer encoder and decoder in BERT?
                            
                                How to have persistent storage for a PYPI package
                            
                                With a PyTorch LSTM, can I have a different hidden_size than input_size?
                            
                                Rolling apply function must be real number, not Nonetype
                            
                                Removing lower case letter in column of Pandas dataframe
                            
                                can I split numpy array with mask?
                            
                                I need help making a discord py temp mute command in discord py
                            
                                How to fix ValueError: multiclass format is not supported
                            
                                kivy camera application with opencv in android shows black screen
                            
                                How to create a new column for each unique component in a given column of a dataframe in Pandas?
                            
                                How to open a project folder in Spyder IDE?
                            
                                browser_switcher_service.cc(238)] XXX Init() error with Python Selenium Script with Chrome for Web Scraping
                            
                                What is the most Pythonic way of processing messages like this Java "instance-filtering" [RabbitMQ]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With