I am using python 3.4 on Jupyter Notebook, trying to merge two data frame like below:
df_A.shape
(204479, 2)
df_B.shape
(178, 3)
new_df = pd.merge(df_A, df_B, how='inner', on='my_icon_number')
new_df.shape
(266788, 4)
I thought the new_df
merged above should have few rows than df_A
since merge is like an inner join. But why new_df
here actually has more rows than df_A
?
Here is what I actually want:
my df_A
is like:
id my_icon_number
-----------------------------
A1 123
B1 234
C1 123
D1 235
E1 235
F1 400
and my df_B
is like:
my_icon_number color size
-------------------------------------
123 blue small
234 red large
235 yellow medium
Then I want new_df
to be:
id my_icon_number color size
--------------------------------------------------
A1 123 blue small
B1 234 red large
C1 123 blue small
D1 235 yellow medium
E1 235 yellow medium
I don't really want to remove duplicates of my_icon_number in df_A. Any idea what I missed here?
Pandas Join vs Merge Differences The main difference between join vs merge would be; join() is used to combine two DataFrames on the index but not on columns whereas merge() is primarily used to specify the columns you wanted to join on, this also supports joining on indexes and combination of index and columns.
The Fastest Ways As it turns out, join always tends to perform well, and merge will perform almost exactly the same given the syntax is optimal.
For large tables dplyr join functions is much faster than merge(). The advantages of using dplyr package for merging dataframes are: They are much faster.
The main difference between merge & concat is that merge allow you to perform more structured "join" of tables where use of concat is more broad and less structured.
Because you have duplicates of the merge column in both data sets, you'll get k * m
rows with that merge column value, where k
is the number of rows with that value in data set 1 and m
is the number of rows with that value in data set 2.
try drop_duplicates
dfa = df_A.drop_duplicates(subset=['my_icon_number'])
dfb = df_B.drop_duplicates(subset=['my_icon_number'])
new_df = pd.merge(dfa, dfb, how='inner', on='my_icon_number')
In this example, the only value in common is 4
but I have it 3 times in each data set. That means I should get 9 total rows in the resulting merge, one for every combination.
df_A = pd.DataFrame(dict(my_icon_number=[1, 2, 3, 4, 4, 4], other_column1=range(6)))
df_B = pd.DataFrame(dict(my_icon_number=[4, 4, 4, 5, 6, 7], other_column2=range(6)))
pd.merge(df_A, df_B, how='inner', on='my_icon_number')
my_icon_number other_column1 other_column2
0 4 3 0
1 4 3 1
2 4 3 2
3 4 4 0
4 4 4 1
5 4 4 2
6 4 5 0
7 4 5 1
8 4 5 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With