I have two columns in a pandas dataframe that are supposed to be identical. Each column has many NaN values. I would like to compare the columns, producing a 3rd column containing True / False values; True when the columns match, False when they do not. This is what I have tried: <pre class="prettyprint"><code>df['new_column'] = (df['column_one'] == df['column_two']) </code></pre> The above works for the numbers, but not the NaN values. I know I could replace the NaNs with a value that doesn't make sense to be in each row (for my data this could be -9999), and then remove it later when I'm ready to echo out the comparison results, however I was wondering if there was a more pythonic method I was overlooking.

Or you could just use the <code>equals</code> method: <pre class="prettyprint"><code>df['new_column'] = df['column_one'].equals(df['column_two']) </code></pre> It is a batteries included approach, and will work no matter the <code>dtype</code> or the content of the cells. You can also put it in a loop, if you want.

Compare columns of Pandas dataframe for equality to produce True/False, even NaNs

Tags:

python

pandas

dataframe

I have two columns in a pandas dataframe that are supposed to be identical. Each column has many NaN values. I would like to compare the columns, producing a 3rd column containing True / False values; True when the columns match, False when they do not.

This is what I have tried:

df['new_column'] = (df['column_one'] == df['column_two'])

The above works for the numbers, but not the NaN values.

I know I could replace the NaNs with a value that doesn't make sense to be in each row (for my data this could be -9999), and then remove it later when I'm ready to echo out the comparison results, however I was wondering if there was a more pythonic method I was overlooking.

293

asked Sep 15 '16 02:09

traggatmot

1 Answers

Or you could just use the equals method:

df['new_column'] = df['column_one'].equals(df['column_two'])

It is a batteries included approach, and will work no matter the dtype or the content of the cells. You can also put it in a loop, if you want.

122

answered Nov 06 '22 05:11

Kartik

Related questions
                            
                                Preferred way to empty multiprocessing.queue(-1) in python
                            
                                How to output full diffs in Django unit tests?
                            
                                Using python to calculate radial angle, in clockwise/counterclockwise directions, given pixel coordinates (and then vice-versa)
                            
                                Handle CTRL-C in Python cmd module
                            
                                Using setuptools, how can I download external data upon installation?
                            
                                "ValueError: labels ['timestamp'] not contained in axis" error
                            
                                Multi-variable linear regression with scipy linregress
                            
                                Updating a pandas DataFrame row with a dictionary
                            
                                Given a byte buffer, dtype, shape and strides, how to create Numpy ndarray
                            
                                Tensorflow error: "Tensor must be from the same graph as Tensor..."
                            
                                Sum of Two Integers without using "+" operator in python
                            
                                Python scikit learn multi-class multi-label performance metrics?
                            
                                Is there any function in python which can perform the inverse of numpy.repeat function?
                            
                                Failure to import matplotlib.pyplot in jupyter (but not ipython)
                            
                                Operation on numpy arrays contain rows with different size
                            
                                How to modify cells in a pandas DataFrame?
                            
                                How to use `Dirichlet Process Gaussian Mixture Model` in Scikit-learn? (n_components?)
                            
                                Matplotlib: How to increase colormap/linewidth quality in streamplot?
                            
                                Generating points on a circle
                            
                                What is the difference between Cerberus Custom Rules and Custom Validators?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With