Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas: merged (inner join) data frame has more rows than the original ones

I am using python 3.4 on Jupyter Notebook, trying to merge two data frame like below:

df_A.shape
(204479, 2)

df_B.shape
(178, 3)

new_df = pd.merge(df_A, df_B,  how='inner', on='my_icon_number')
new_df.shape
(266788, 4)

I thought the new_df merged above should have few rows than df_A since merge is like an inner join. But why new_df here actually has more rows than df_A?

Here is what I actually want:

my df_A is like:

 id           my_icon_number
-----------------------------
 A1             123             
 B1             234
 C1             123
 D1             235
 E1             235
 F1             400

and my df_B is like:

my_icon_number    color      size
-------------------------------------
  123              blue      small
  234              red       large 
  235              yellow    medium

Then I want new_df to be:

 id           my_icon_number     color       size
--------------------------------------------------
 A1             123              blue        small
 B1             234              red         large
 C1             123              blue        small
 D1             235              yellow      medium
 E1             235              yellow      medium

I don't really want to remove duplicates of my_icon_number in df_A. Any idea what I missed here?

like image 399
Edamame Avatar asked Jan 10 '17 23:01

Edamame


People also ask

What is difference between joining and merging in Pandas Dataframe?

Pandas Join vs Merge Differences The main difference between join vs merge would be; join() is used to combine two DataFrames on the index but not on columns whereas merge() is primarily used to specify the columns you wanted to join on, this also supports joining on indexes and combination of index and columns.

Is merge or join faster Pandas?

The Fastest Ways As it turns out, join always tends to perform well, and merge will perform almost exactly the same given the syntax is optimal.

Which is faster join or merge?

For large tables dplyr join functions is much faster than merge(). The advantages of using dplyr package for merging dataframes are: They are much faster.

What is the difference between merge join and concatenate in Pandas?

The main difference between merge & concat is that merge allow you to perform more structured "join" of tables where use of concat is more broad and less structured.


1 Answers

Because you have duplicates of the merge column in both data sets, you'll get k * m rows with that merge column value, where k is the number of rows with that value in data set 1 and m is the number of rows with that value in data set 2.

try drop_duplicates

dfa = df_A.drop_duplicates(subset=['my_icon_number'])
dfb = df_B.drop_duplicates(subset=['my_icon_number'])

new_df = pd.merge(dfa, dfb, how='inner', on='my_icon_number')

Example

In this example, the only value in common is 4 but I have it 3 times in each data set. That means I should get 9 total rows in the resulting merge, one for every combination.

df_A = pd.DataFrame(dict(my_icon_number=[1, 2, 3, 4, 4, 4], other_column1=range(6)))
df_B = pd.DataFrame(dict(my_icon_number=[4, 4, 4, 5, 6, 7], other_column2=range(6)))

pd.merge(df_A, df_B,  how='inner', on='my_icon_number')

   my_icon_number  other_column1  other_column2
0               4              3              0
1               4              3              1
2               4              3              2
3               4              4              0
4               4              4              1
5               4              4              2
6               4              5              0
7               4              5              1
8               4              5              2
like image 164
piRSquared Avatar answered Oct 27 '22 08:10

piRSquared