I just ran into some weird behaviour comparing the values of two pandas dataframes using pd.Dataframe.equals()
:
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df2 = df1.copy()
df1.equals(df2)
# True (obviously)
However, when I change the column type to a different integer format, they will not be considered equal anymore:
df1['a'] = df1['a'].astype(np.int32)
df1.equals(df2)
# False
In the .equals()
documentation, they point out that the variables must have the same type, and present an example comparing floats to integers, which doesn't work. I didn't expect this to extend to different types of integers, too.
When doing the same comparison using ==
, it does return True
:
(df1 == df2).all().all()
# True
However, ==
doesn't assess two missing values as equal to each other.
Is there an elegant way to handle missing values as equal, whilst not enforcing the same integer type? The best I can come up with is:
(df1.fillna(0) == df2.fillna(0)).all().all()
but there has to be a more concise and less hacky way to deal with this problem.
My follow up, opinion-based question: Would you consider this a bug?
To check the data type in pandas DataFrame we can use the “dtype” attribute. The attribute returns a series with the data type of each column. And the column names of the DataFrame are represented as the index of the resultant series object and the corresponding data types are returned as values of the series object.
Compare two Series objects of the same length and return a Series where each element is True if the element in each Series is equal, False otherwise. Compare two DataFrame objects of the same shape and return a DataFrame where each element is True if the respective element in each DataFrame is equal, False otherwise.
In this article, we will discuss how to compare two DataFrames in pandas. First, let’s create two DataFrames. By using equals () function we can directly check if df1 is equal to df2. This function is used to determine if two dataframe objects in consideration are equal or not.
We can find the differences between the assists and points for each player by using the pandas subtract () function: Player A had the same amount of points in both DataFrames, but they had 3 more assists in DataFrame 2. Player B had 9 more points and 2 more assists in DataFrame 2 compared to DataFrame 1.
Example 1: Find out if the two DataFrames are identical. We can first find out if the two DataFrames are identical by using the DataFrame.equals () function:
Merge function is similar to SQL inner join, we find the common rows between two dataframes. The concat () function does all the heavy lifting of performing concatenation operations along with an axis od Pandas objects while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.
If you think of this as a decimal problem (i.e. does 2 equal 2) then this perhaps looks like a bug. However, if you look at it from how the interpreter sees it (i.e. does 00000010 equal 0000000000000010) then it becomes plain that there is indeed a difference. Bitwise operations.
From a validation perspective, it is probably a good idea to make sure you are comparing apples to apples and so I like the answer of @Ben.T:
df1.equals(df2.astype(df1.dtypes))
Is this a bug? That is above my pay grade. You can submit it, and the thinkers surrounding the pandas library can make a decision. It does seem odd that the '==' operator gives different results that the '.equals' function and that may sway the decision.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With