Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Number of rows changes even after `pandas.merge` with `left` option

Tags:

python

pandas

I am merging two data frames using pandas.merge. Even after specifying how = left option, I found the number of rows of merged data frame is larger than the original. Why does this happen?

panel = pd.read_csv(file1, encoding ='cp932')
before_len = len(panel)

prof_2000 = pd.read_csv(file2, encoding ='cp932').drop_duplicates()

temp_2000 = pd.merge(panel, prof_2000, left_on='Candidate_u', right_on="name2", how="left")

after_len =  len(temp_2000)
print(before_len, after_len)
> 12661 13915
like image 818
user51966 Avatar asked May 08 '16 00:05

user51966


2 Answers

This sounds like having more than one rows in right under 'name2' that match the key you have set for the left. Using option 'how='left' with pandas.DataFrame.merge() only means that:

  • left: use only keys from left frame

However, the actual number of rows in the result object is not necessarily going to be the same as the number of rows in the left object.

Example:

In [359]: df_1
Out[359]: 
   A    B
0  a  AAA
1  b  BBA
2  c  CCF

and then another DF that looks like this (notice that there are more than one entry for your desired key on the left):

In [360]: df_3
Out[360]: 
  key  value
0   a      1
1   a      2
2   b      3
3   a      4

If I merge these two on left.A, here's what happens:

In [361]: df_1.merge(df_3, how='left', left_on='A', right_on='key')
Out[361]: 
   A    B  key  value
0  a  AAA    a    1.0
1  a  AAA    a    2.0
2  a  AAA    a    4.0
3  b  BBA    b    3.0
4  c  CCF  NaN    NaN

This happened even though I merged with how='left' as you can see above, there were simply more than one rows to merge and as shown here the result pd.DataFrame has in fact more rows than the pd.DataFrame on the left.

I hope this helps!

like image 52
Thanos Avatar answered Oct 14 '22 01:10

Thanos


The problem of doubling of rows after each merge() (of any type, 'both' or 'left') is usually caused by duplicates in any of the keys, so we need to drop them first:

left_df.drop_duplicates(subset=left_key, inplace=True)
right_df.drop_duplicates(subset=right_key, inplace=True)
like image 27
mirekphd Avatar answered Oct 14 '22 00:10

mirekphd