Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas merge(how="inner") result is bigger than both dataframes

I'm trying to find overlapping rows in two pandas DataFrames with the same columns, but different number of rows:

df1.shape
(187399, 784)

df2.shape
(9790, 784)

After the pd.merge() operation

common_cols = df1.columns.tolist()
df3 = pd.merge(df1, df2, on=common_cols, how="inner")

I get the result that is bigger than both df1 and df2

df3.shape
(283979, 784)

How is it possible and what am I doing wrong? I have two dfs, both with 784 columns named [0,1,2,3...783] and different number of rows in each df. I just want to find the intersection of identical rows in these dfs. Meaning that if a row is present in df1 and df2, it has to go to df3 In a previous step I removed the duplicates from each df with pd.drop_duplicates()

Link to the jupyter notebook with code after the header "Problem 5" https://github.com/kuatroka/udacity_deep_learning/blob/master/1_notmnist-Copy1.ipynb

like image 859
kuatroka Avatar asked Apr 13 '17 14:04

kuatroka


2 Answers

Consider the two dataframes df1 and df2

df1 = pd.DataFrame(dict(A=[1, 1, 1], B=[9, 8, 7]))
df2 = pd.DataFrame(dict(A=[1, 1, 1], C=[6, 5, 4]))


print(df1)
print()
print(df2)

   A  B
0  1  9
1  1  8
2  1  7

   A  C
0  1  6
1  1  5
2  1  4

If we merge on column 'A', it will return a dataframe for every combination of rows where both column 'A's are equal to one.

df1.merge(df2)

   A  B  C
0  1  9  6
1  1  9  5
2  1  9  4
3  1  8  6
4  1  8  5
5  1  8  4
6  1  7  6
7  1  7  5
8  1  7  4

Answer
You have duplicate rows in both dataframes for the same keys you are merging on.

To solve that problem, you can (though you need to decide if this is appropriate for you)

df1.drop_duplicates(common_cols).merge(df2.drop_duplicates(common_cols))
like image 59
piRSquared Avatar answered Sep 18 '22 13:09

piRSquared


I want to post the solution to my own problem, but it was totally technical, not functional, therefore what @piRSquared was totally correct.

It turned out a very strange problem. In my conda installation I had Intel MKL module installed and by default it was on. This module supposedly improves speeds of numpy, scipy and scikit-learn. Once I disabled it with the CLI command conda install nomkl ,I got correct results from my very first code. I'm adding new tags for MKL in case someone else get this strange numpy.merge() behaviour Thanks to everyone.

like image 39
kuatroka Avatar answered Sep 20 '22 13:09

kuatroka