I'm trying to find overlapping rows in two pandas DataFrames with the same columns, but different number of rows:
df1.shape
(187399, 784)
df2.shape
(9790, 784)
After the pd.merge()
operation
common_cols = df1.columns.tolist()
df3 = pd.merge(df1, df2, on=common_cols, how="inner")
I get the result that is bigger than both df1 and df2
df3.shape
(283979, 784)
How is it possible and what am I doing wrong?
I have two dfs, both with 784 columns named [0,1,2,3...783]
and different number of rows in each df. I just want to find the intersection of identical rows in these dfs. Meaning that if a row is present in df1
and df2
, it has to go to df3
In a previous step I removed the duplicates from each df with pd.drop_duplicates()
Link to the jupyter notebook with code after the header "Problem 5" https://github.com/kuatroka/udacity_deep_learning/blob/master/1_notmnist-Copy1.ipynb
Consider the two dataframes df1
and df2
df1 = pd.DataFrame(dict(A=[1, 1, 1], B=[9, 8, 7]))
df2 = pd.DataFrame(dict(A=[1, 1, 1], C=[6, 5, 4]))
print(df1)
print()
print(df2)
A B
0 1 9
1 1 8
2 1 7
A C
0 1 6
1 1 5
2 1 4
If we merge
on column 'A'
, it will return a dataframe for every combination of rows where both column 'A'
s are equal to one.
df1.merge(df2)
A B C
0 1 9 6
1 1 9 5
2 1 9 4
3 1 8 6
4 1 8 5
5 1 8 4
6 1 7 6
7 1 7 5
8 1 7 4
Answer
You have duplicate rows in both dataframes for the same keys you are merging on.
To solve that problem, you can (though you need to decide if this is appropriate for you)
df1.drop_duplicates(common_cols).merge(df2.drop_duplicates(common_cols))
I want to post the solution to my own problem, but it was totally technical, not functional, therefore what @piRSquared was totally correct.
It turned out a very strange problem. In my conda installation I had Intel MKL module installed and by default it was on. This module supposedly improves speeds of numpy, scipy and scikit-learn. Once I disabled it with the CLI command conda install nomkl
,I got correct results from my very first code. I'm adding new tags for MKL in case someone else get this strange numpy.merge()
behaviour
Thanks to everyone.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With