Here are how the dataframes columns look like.
df1='device number', 'date', ....<<10 other columns>> 3500 records
df2='device number', 'date', ....<<9 other columns>> 14,000 records
In each data frame, neither 'device number', nor 'date' are unique. However, their combination is unique to identify a row.
I am trying to form a new data frame which matches the rows from df1 and df2 where both device number and date are equal, and have all the columns from these df1 and df2. The pandas command I am trying is
df3=pd.merge(df1, df2, how='inner', on=['device number', 'date'])
However, df3 gives me a dataframe of shape (14,000, 21). The column number makes sense, but how can the inner join has more rows than any of the left dataframes? Does it mean I have a flaw in my understanding of inner join? Also, how can I achieve the result I described?
As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.
Pandas Join vs Merge Differences The main difference between join vs merge would be; join() is used to combine two DataFrames on the index but not on columns whereas merge() is primarily used to specify the columns you wanted to join on, this also supports joining on indexes and combination of index and columns.
INNER Merge Pandas uses “inner” merge by default. This keeps only the common values in both the left and right dataframes for the merged data. In our case, only the rows that contain use_id values that are common between user_usage and user_device remain in the merged data — inner_merge.
The main difference between merge & concat is that merge allow you to perform more structured "join" of tables where use of concat is more broad and less structured.
Only way I can see this happening... particularly with the 14,000 being the same exact number as the number of records in df2
is if the column combination in df2
are not unique.
You can verify that they are not unique with the following (True
if unique)
df2.duplicated(['device number', 'date']).sum() == 0
Or
df.set_index(['device number', 'date']).index.is_unique
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With