I wanted to merge two datasets on their key value and got strange results. I made a simple version to reproduce that problem. <pre class="prettyprint"><code>df = pd.DataFrame({'key':[1, 2, 3]}) other = pd.DataFrame({'key':[1, 2, 3]}) df.join( other, on='key', lsuffix='_caller' ) </code></pre> I got this output: <pre class="prettyprint"><code> key_caller key 0 1 2.0 1 2 3.0 2 3 NaN </code></pre> I thought this was strange, so I decided to try this one: <pre class="prettyprint"><code>df = pd.DataFrame({'key':[i for i in range(3)]}) other = pd.DataFrame({'key':[i for i in range(3)]}) df.join( other, on='key', lsuffix='_caller' ) </code></pre> And got the result I expected: <pre class="prettyprint"><code> key_caller key 0 0 0 1 1 1 2 2 2 </code></pre> If there is no zero then the join is messed up, but if there is zero everything works fine. So can someone explain what's going on?

The values of the two examples are different. In the first, they are 1, 2, and 3. In the second example, they are 0, 1, 2. <code>join</code> uses the column name in the left dataframe and the index in the right. In the second example, because you used <code>range</code>, the index of the right dataframe is identical to the values of key in the left dataframe, so the match is perfect. In the first example, there is no index for 3, so you get NaN, which causes the values to be converted to float.

Is this a bug or do I not understand something?

Tags:

python

pandas

dataframe

I wanted to merge two datasets on their key value and got strange results. I made a simple version to reproduce that problem.

df    = pd.DataFrame({'key':[1, 2, 3]})
other = pd.DataFrame({'key':[1, 2, 3]})

df.join(
    other,
    on='key',
    lsuffix='_caller'
)

I got this output:

    key_caller  key
0   1           2.0
1   2           3.0
2   3           NaN

I thought this was strange, so I decided to try this one:

df    = pd.DataFrame({'key':[i for i in range(3)]})
other = pd.DataFrame({'key':[i for i in range(3)]})

df.join(
    other,
    on='key',
    lsuffix='_caller'
)

And got the result I expected:

    key_caller  key
0   0           0
1   1           1
2   2           2

If there is no zero then the join is messed up, but if there is zero everything works fine.

So can someone explain what's going on?

569

asked Apr 17 '20 21:04

Narek Maloyan

1 Answers

The values of the two examples are different. In the first, they are 1, 2, and 3. In the second example, they are 0, 1, 2.

join uses the column name in the left dataframe and the index in the right. In the second example, because you used range, the index of the right dataframe is identical to the values of key in the left dataframe, so the match is perfect. In the first example, there is no index for 3, so you get NaN, which causes the values to be converted to float.

answered Oct 12 '22 14:10

Eric Truett

Related questions
                            
                                sklearn ImportError: cannot import name plot_roc_curve
                            
                                Multi-layer graph in networkx
                            
                                python selenium headless chromedriver not loading full page when it was working the day before with no changes to the code
                            
                                How to extract multiple numbers from Pandas Dataframe
                            
                                Is there a way to list of parameters of FMU (or of submodel in FMU) using the python libraries FMPy or pyFMI?
                            
                                Shrinking AWS Lambda deployment package with CFLAGS and PIP to fit sklearn
                            
                                TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category
                            
                                Pytorch ImageNet dataset
                            
                                Pyspark: how to extract hour from timestamp
                            
                                How to avoid conda activate base from automatically executing in my VS Code editor?
                            
                                unauthorized_client: Grant type 'authorization_code' not allowed for the client. Django -auth0 -login
                            
                                How to replace loss function during training tensorflow.keras
                            
                                Django: how to get Foreign key id?
                            
                                find least common denominator for list of fractions in python
                            
                                Reindex MultiIndex with unique values in level
                            
                                Can I convert spectrograms generated with librosa back to audio?
                            
                                How do I setup my own time zone in Django?
                            
                                Librosa raised OSError('sndfile library not found') in Docker
                            
                                AttributeError: module 'os' has no attribute 'uname
                            
                                Discord.py - how to detect if a user mentions/pings the bot

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With