Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

inner join/merge in pandas dataframe give more rows than left dataframe

Here are how the dataframes columns look like.

df1='device number', 'date', ....<<10 other columns>> 3500 records

df2='device number', 'date', ....<<9 other columns>> 14,000 records

In each data frame, neither 'device number', nor 'date' are unique. However, their combination is unique to identify a row.

I am trying to form a new data frame which matches the rows from df1 and df2 where both device number and date are equal, and have all the columns from these df1 and df2. The pandas command I am trying is

df3=pd.merge(df1, df2, how='inner', on=['device number', 'date'])

However, df3 gives me a dataframe of shape (14,000, 21). The column number makes sense, but how can the inner join has more rows than any of the left dataframes? Does it mean I have a flaw in my understanding of inner join? Also, how can I achieve the result I described?

like image 353
Della Avatar asked Jul 23 '17 06:07

Della


People also ask

Is merge or join faster pandas?

As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.

What is difference between joining and merging in pandas DataFrame?

Pandas Join vs Merge Differences The main difference between join vs merge would be; join() is used to combine two DataFrames on the index but not on columns whereas merge() is primarily used to specify the columns you wanted to join on, this also supports joining on indexes and combination of index and columns.

How does inner merge work in pandas?

INNER Merge Pandas uses “inner” merge by default. This keeps only the common values in both the left and right dataframes for the merged data. In our case, only the rows that contain use_id values that are common between user_usage and user_device remain in the merged data — inner_merge.

What is the difference between merge join and concatenate in pandas?

The main difference between merge & concat is that merge allow you to perform more structured "join" of tables where use of concat is more broad and less structured.


1 Answers

Only way I can see this happening... particularly with the 14,000 being the same exact number as the number of records in df2 is if the column combination in df2 are not unique.

You can verify that they are not unique with the following (True if unique)

df2.duplicated(['device number', 'date']).sum() == 0

Or

df.set_index(['device number', 'date']).index.is_unique
like image 106
piRSquared Avatar answered Sep 27 '22 23:09

piRSquared