I have a number of large dataframes in a list. I concatenate all of them to produce a single large dataframe.
df_list # This contains a list of dataframes
result = pd.concat(df_list, axis=0)
result.columns.duplicated().any() # This returns True
My expectation was that pd.concat will not produce duplicate columns.
I want to understand when it could result in duplicate columns so that I can debug the source.
I could not reproduce the problem with a toy dataset.
I have verified that the input data frames have unique columns by running df.columns.duplicated().any()
.
The pandas version used 1.0.1
(Pdb) p result_data[0].columns.duplicated().any()
False
(Pdb) p result_data[1].columns.duplicated().any()
False
(Pdb) p result_data[2].columns.duplicated().any()
False
(Pdb) p result_data[3].columns.duplicated().any()
False
(Pdb) p pd.concat(result_data[0:4]).columns.duplicated().any()
True
merge() function to join the two data frames by inner join. Now, add a suffix called 'remove' for newly joined columns that have the same name in both data frames. Use the drop() function to remove the columns with the suffix 'remove'. This will ensure that identical columns don't exist in the new dataframe.
Output: As shown in the output image, we get the concatenation of dataframes without removing duplicates.
In this benchmark, concatenating multiple dataframes by using the Pandas. concat function is 50 times faster than using the DataFrame. append version.
The main difference between merge & concat is that merge allow you to perform more structured "join" of tables where use of concat is more broad and less structured.
Check the below behaviour:
In [452]: df1 = pd.DataFrame({'A':[1,2,3], 'B':[2,3,4]})
In [468]: df2 = pd.DataFrame({'A':[1,2,3], 'B':[2,4,5]})
In [460]: df_list = [df1,df2]
This concats and keeps duplicate columns:
In [463]: pd.concat(df_list, axis=1)
Out[474]:
A B A B
0 1 2 1 2
1 2 3 2 4
2 3 4 3 5
pd.concat
always concatenates the dataframes as is. It does not drop duplicate columns at all.
If you concatenate without the axis, it will append one dataframe below another in the same columns.
So you can have duplicate rows now, but not columns.
In [477]: pd.concat(df_list)
Out[477]:
A B
0 1 2 ## duplicate row
1 2 3
2 3 4
0 1 2 ## duplicate row
1 2 4
2 3 5
You can remove these duplicate rows by using drop_duplicates()
:
In [478]: pd.concat(df_list).drop_duplicates()
Out[478]:
A B
0 1 2
1 2 3
2 3 4
1 2 4
2 3 5
Update after OP's comment:
In [507]: df_list[0].columns.duplicated().any()
Out[507]: False
In [508]: df_list[1].columns.duplicated().any()
Out[508]: False
In [510]: pd.concat(df_list[0:2]).columns.duplicated().any()
Out[510]: False
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With