Say I have large datasets in R and I just want to know whether two of them they are the same. I use this often when I'm experimenting different algorithms to achieve the same result. For example, say we have the following datasets: <pre class="prettyprint"><code>df1 <- data.frame(num = 1:5, let = letters[1:5]) df2 <- df1 df3 <- data.frame(num = c(1:5, NA), let = letters[1:6]) df4 <- df3 </code></pre> So this is what I do to compare them: <pre class="prettyprint"><code>table(x == y, useNA = 'ifany') </code></pre> Which works great when the datasets have no NAs: <pre class="prettyprint"><code>> table(df1 == df2, useNA = 'ifany') TRUE 10 </code></pre> But not so much when they have NAs: <pre class="prettyprint"><code>> table(df3 == df4, useNA = 'ifany') TRUE <NA> 11 1 </code></pre> In the example, it's easy to dismiss the <code>NA</code> as not a problem since we know that both dataframes are equal. The problem is that <code>NA == <anything></code> yields <code>NA</code>, so whenever one of the datasets has an <code>NA</code>, it doesn't matter what the other one has on that same position, the result is always going to be <code>NA</code>. So using <code>table()</code> to compare datasets doesn't seem ideal to me. How can I better check if two data frames are identical? P.S.: Note this is not a duplicate of R - comparing several datasets, Comparing 2 datasets in R or Compare datasets in R

Look up all.equal. It has some riders but it might work for you. <pre class="prettyprint"><code>all.equal(df3,df4) # [1] TRUE all.equal(df2,df1) # [1] TRUE </code></pre>

As Metrics pointed out, one could also use <code>identical()</code> to compare the datasets. The difference between this approach and that of Codoremifa is that <code>identical()</code> will just yield <code>TRUE</code> of <code>FALSE</code>, depending whether the objects being compared are identical or not, whereas <code>all.equal()</code> will either return <code>TRUE</code> or hints about the differences between the objects. For instance, consider the following: <pre class="prettyprint"><code>> identical(df1, df3) [1] FALSE > all.equal(df1, df3) [1] "Attributes: < Component 2: Numeric: lengths (5, 6) differ >" [2] "Component 1: Numeric: lengths (5, 6) differ" [3] "Component 2: Lengths: 5, 6" [4] "Component 2: Attributes: < Component 2: Lengths (5, 6) differ (string compare on first 5) >" [5] "Component 2: Lengths (5, 6) differ (string compare on first 5)" </code></pre> Moreover, from what I've tested <code>identical()</code> seems to run much faster than <code>all.equal()</code>.

How to check if two data frames are equal [duplicate]

Tags:

database

dataframe

r

compare

dataset

Say I have large datasets in R and I just want to know whether two of them they are the same. I use this often when I'm experimenting different algorithms to achieve the same result. For example, say we have the following datasets:

df1 <- data.frame(num = 1:5, let = letters[1:5]) df2 <- df1 df3 <- data.frame(num = c(1:5, NA), let = letters[1:6]) df4 <- df3

So this is what I do to compare them:

table(x == y, useNA = 'ifany')

Which works great when the datasets have no NAs:

> table(df1 == df2, useNA = 'ifany') TRUE    10

But not so much when they have NAs:

> table(df3 == df4, useNA = 'ifany') TRUE <NA>    11    1

In the example, it's easy to dismiss the NA as not a problem since we know that both dataframes are equal. The problem is that NA == <anything> yields NA, so whenever one of the datasets has an NA, it doesn't matter what the other one has on that same position, the result is always going to be NA.

So using table() to compare datasets doesn't seem ideal to me. How can I better check if two data frames are identical?

P.S.: Note this is not a duplicate of R - comparing several datasets, Comparing 2 datasets in R or Compare datasets in R

795

asked Oct 01 '13 14:10

Waldir Leoncio

2 Answers

Look up all.equal. It has some riders but it might work for you.

all.equal(df3,df4) # [1] TRUE all.equal(df2,df1) # [1] TRUE

146

answered Sep 19 '22 23:09

TheComeOnMan

As Metrics pointed out, one could also use identical() to compare the datasets. The difference between this approach and that of Codoremifa is that identical() will just yield TRUE of FALSE, depending whether the objects being compared are identical or not, whereas all.equal() will either return TRUE or hints about the differences between the objects. For instance, consider the following:

> identical(df1, df3) [1] FALSE  > all.equal(df1, df3) [1] "Attributes: < Component 2: Numeric: lengths (5, 6) differ >"                                 [2] "Component 1: Numeric: lengths (5, 6) differ"                                                 [3] "Component 2: Lengths: 5, 6"                                                                  [4] "Component 2: Attributes: < Component 2: Lengths (5, 6) differ (string compare on first 5) >" [5] "Component 2: Lengths (5, 6) differ (string compare on first 5)"

Moreover, from what I've tested identical() seems to run much faster than all.equal().

answered Sep 16 '22 23:09

Waldir Leoncio

Related questions
                            
                                MySQL stored procedures use them or not to use them
                            
                                mysql server port number
                            
                                Composite primary keys versus unique object ID field
                            
                                Which one to use, int or Integer
                            
                                Can't import database through phpmyadmin file size too large
                            
                                AWS: can't connect to RDS database from my machine
                            
                                Database/SQL: How to store longitude/latitude data?
                            
                                When or why would you use a right outer join instead of left?
                            
                                What are MySQL database engines? [closed]
                            
                                mysqldump with create database line
                            
                                Connecting postgresql with sqlalchemy
                            
                                SQL query to select distinct row with minimum value
                            
                                In Mongo, how do I pretty-print results so .find() looks like .findOne()
                            
                                How to delete duplicate rows without unique identifier
                            
                                Strange PostgreSQL "value too long for type character varying(500)"
                            
                                Detecting database tampering, is it possible?
                            
                                What is caching?
                            
                                When should I use Datomic?
                            
                                When shouldn't you use a relational database? [closed]
                            
                                OSError: [Errno 18] Invalid cross-device link

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to check if two data frames are equal [duplicate]

Tags:

database

dataframe

r

compare

dataset

Waldir Leoncio

People also ask

2 Answers

TheComeOnMan

Waldir Leoncio

Recent Activity

Donate For Us