Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is this a bug or do I not understand something?

I wanted to merge two datasets on their key value and got strange results. I made a simple version to reproduce that problem.

df    = pd.DataFrame({'key':[1, 2, 3]})
other = pd.DataFrame({'key':[1, 2, 3]})

df.join(
    other,
    on='key',
    lsuffix='_caller'
)

I got this output:

    key_caller  key
0   1           2.0
1   2           3.0
2   3           NaN

I thought this was strange, so I decided to try this one:

df    = pd.DataFrame({'key':[i for i in range(3)]})
other = pd.DataFrame({'key':[i for i in range(3)]})

df.join(
    other,
    on='key',
    lsuffix='_caller'
)

And got the result I expected:

    key_caller  key
0   0           0
1   1           1
2   2           2

If there is no zero then the join is messed up, but if there is zero everything works fine.

So can someone explain what's going on?

like image 569
Narek Maloyan Avatar asked Apr 17 '20 21:04

Narek Maloyan


People also ask

What does it mean when you can't understand something?

If you can't hear or understand something, it's unintelligible (and probably frustrating too).

How do you know when you truly understand something?

When you can say something in multiple ways using different words, you understand it really well. Being able to explain something in a simple, accessible way shows you've done the work required to learn. Skipping it leads to the illusion of knowledge—an illusion that can be quickly shattered when challenged.

Why can't I understand what I just read?

Dyslexia is one type of reading disorder. It generally refers to difficulties reading individual words and can lead to problems understanding text. Most reading disorders result from specific differences in the way the brain processes written words and text. Usually, these differences are present from a young age.


1 Answers

The values of the two examples are different. In the first, they are 1, 2, and 3. In the second example, they are 0, 1, 2.

join uses the column name in the left dataframe and the index in the right. In the second example, because you used range, the index of the right dataframe is identical to the values of key in the left dataframe, so the match is perfect. In the first example, there is no index for 3, so you get NaN, which causes the values to be converted to float.

like image 59
Eric Truett Avatar answered Oct 12 '22 14:10

Eric Truett