I have two large data frames, <code>a</code> and <code>b</code> for which <code>identical(a,b)</code> is <code>TRUE</code>, as is <code>all.equal(a,b)</code>, but <code>identical(digest(a),digest(b))</code> is <code>FALSE</code>. What could cause this? What's more, I tried to dig in deeper, by applying digest to bunches of rows. Incredibly, at least to me, there is agreement in the digest values on sub-frames all the way to the last row of the data frames. Here is a sequence of comparisons: <pre class="prettyprint"><code>> identical(a, b) [1] TRUE > all.equal(a, b) [1] TRUE > digest(a) [1] "cac56b06078733b6fb520442e5482684" > digest(b) [1] "fdd5ab78ca961982d195f800e3cf60af" > digest(a[1:nrow(a),]) [1] "e44f906723405756509a6b17b5949d1a" > digest(b[1:nrow(b),]) [1] "e44f906723405756509a6b17b5949d1a" </code></pre> Every method I can think of indicates these two objects are identical, but their digest values are different. Is there something else about data frames that can produce such discrepancies? <hr> For further details: the objects are about 10M rows x 12 columns. Here's the output of <code>str()</code>: <pre class="prettyprint"><code>'data.frame': 10056987 obs. of 12 variables: $ V1 : num 1 11 21 31 41 61 71 81 91 101 ... $ V2 : num 1 1 1 1 1 1 1 1 1 1 ... $ V3 : num 2 3 2 3 4 5 2 4 2 4 ... $ V4 : num 1 1 1 1 1 1 1 1 1 1 ... $ V5 : num 1.8 2.29 1.94 2.81 3.06 ... $ V6 : num 0.0653 0.0476 0.0324 0.034 0.0257 ... $ V7 : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ... $ V8 : num 0.00653 0.00476 0.00324 0.0034 0.00257 ... $ V9 : num 1.8 2.3 1.94 2.81 3.06 ... $ V10: num 0.1957 0.7021 0.0604 0.1866 0.9371 ... $ V11: num 1704 1554 1409 1059 1003 ... $ V12: num 23309 23309 23309 23309 23309 ... > print(object.size(a), units = "Mb") 920.7 Mb </code></pre> <hr> Update 1: On a whim, I converted these to matrices. The digests are the same. <pre class="prettyprint"><code>> aM = as.matrix(a) > bM= as.matrix(b) > identical(aM,bM) [1] TRUE > digest(aM) [1] "c5147d459ba385ca8f30dcd43760fc90" > digest(bM) [1] "c5147d459ba385ca8f30dcd43760fc90" </code></pre> I then tried converting back to a data frame, and the digest values are equal (and equal to the previous value for <code>a</code>). <pre class="prettyprint"><code>> aMF = as.data.frame(aM) > bMF = as.data.frame(bM) > digest(aMF) [1] "cac56b06078733b6fb520442e5482684" > digest(bMF) [1] "cac56b06078733b6fb520442e5482684" </code></pre> So, <code>b</code> looks like the bad boy, and it has a colorful past. <code>b</code> came from a much bigger data frame, say <code>B</code>. I took only the columns of <code>B</code> that appeared in <code>a</code> and checked to see if they were equal. Well, they were equal, but had different digests. I converted the column names (from "InformativeColumnName1" to "V1", etc.), just to avoid any issues that might arise - though <code>all.equal</code> and <code>identical</code> tend to point out when column names differ. Since I am working on two different programs and don't have simultaneous access to <code>a</code> and <code>b</code>, it is easiest for me to use the digest values to check the calculations. However, something seems to be odd in how I extract columns from a data frame and then apply <code>digest()</code> to it. <hr> ANSWER: It turns out, to my astonishment (dismay, horror, embarrassment, you name it), <code>identical</code> is very forgiving about attributes. I had assumed that only <code>all.equal</code> was forgiving about attributes. This was discovered via Tommy's suggestion <code>identical(d1, d2, attrib.as.set=FALSE)</code>. Running <code>attributes(a)</code> is a bad, bad idea: the deluge of row names took awhile before Ctrl-C could interrupt it. Here is the output of <code>names(attributes())</code>: <pre class="prettyprint"><code>> names(attributes(a)) [1] "names" "row.names" "class" > names(attributes(b)) [1] "names" "class" "row.names" </code></pre> They're in different orders! Kudos to <code>digest()</code> for being straight with me. UPDATE To aid others with this problem, it seems that simply rearranging the attributes will be adequate to get identical hash values. Since tinkering with attribute orders is new to me, this may break something, but it works in my case. Note that it is a little time consuming if the objects are big; I'm not aware of a faster method for doing this. (I'm also looking to move to using matrices or data tables instead of data frames, and this may be another incentive to avoid data frames.) <pre class="prettyprint"><code>tmpA0 = attributes(a) tmpA1 = tmpA0[sort(names(tmpA0))] a2 = a attributes(a2) = tmpA1 tmpB0 = attributes(b) tmpB1 = tmpB0[sort(names(tmpB0))] b2 = b attributes(b2) = tmpB1 digest(a2) # e04e624692d82353479efbd713ec03f6 digest(b2) # e04e624692d82353479efbd713ec03f6 identical(b,b2, attrib.as.set = FALSE) # FALSE identical(b,b2, attrib.as.set = TRUE) # TRUE identical(a2,b2, attrib.as.set = FALSE) # TRUE </code></pre>

Without having the actual data.frames it is of course hard to know, but one difference could be the order of the attributes. <code>identical</code> ignores that by default, but setting <code>attrib.as.set=FALSE</code> can change that: <pre class="prettyprint"><code>d1 <- structure(1, foo=1, bar=2) d2 <- structure(1, bar=2, foo=1) identical(d1, d2) # TRUE identical(d1, d2, attrib.as.set=FALSE) # FALSE </code></pre>

Our digest package uses the internal R function <code>serialize()</code> to get what we feed to the hash-generating functions (md5, sha1, ...). So I strongly suspect that may have something like an attribute differ. Until you can construct something reproducible that does not depend on your 1e7 x 12 data set, there is little we can do. Also, the <code>digest()</code> function can output intermediate results and (as of the recent 0.5.1 version) even <code>raw</code> vectors. That may help. Lastly, you can always contact us (as the package maintainers / authors) off-line which happens to be the recommended way within R land, the popularity of StackOverflow not withstanding.

Identical data frames with different digests in R?

Tags:

dataframe

r

hash

I have two large data frames, a and b for which identical(a,b) is TRUE, as is all.equal(a,b), but identical(digest(a),digest(b)) is FALSE. What could cause this?

What's more, I tried to dig in deeper, by applying digest to bunches of rows. Incredibly, at least to me, there is agreement in the digest values on sub-frames all the way to the last row of the data frames.

Here is a sequence of comparisons:

> identical(a, b)
[1] TRUE
> all.equal(a, b)
[1] TRUE
> digest(a)
[1] "cac56b06078733b6fb520442e5482684"
> digest(b)
[1] "fdd5ab78ca961982d195f800e3cf60af"
> digest(a[1:nrow(a),])
[1] "e44f906723405756509a6b17b5949d1a"
> digest(b[1:nrow(b),])
[1] "e44f906723405756509a6b17b5949d1a"

Every method I can think of indicates these two objects are identical, but their digest values are different. Is there something else about data frames that can produce such discrepancies?

For further details: the objects are about 10M rows x 12 columns. Here's the output of str():

'data.frame':   10056987 obs. of  12 variables:
 $ V1 : num  1 11 21 31 41 61 71 81 91 101 ...
 $ V2 : num  1 1 1 1 1 1 1 1 1 1 ...
 $ V3 : num  2 3 2 3 4 5 2 4 2 4 ...
 $ V4 : num  1 1 1 1 1 1 1 1 1 1 ...
 $ V5 : num  1.8 2.29 1.94 2.81 3.06 ...
 $ V6 : num  0.0653 0.0476 0.0324 0.034 0.0257 ...
 $ V7 : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
 $ V8 : num  0.00653 0.00476 0.00324 0.0034 0.00257 ...
 $ V9 : num  1.8 2.3 1.94 2.81 3.06 ...
 $ V10: num  0.1957 0.7021 0.0604 0.1866 0.9371 ...
 $ V11: num  1704 1554 1409 1059 1003 ...
 $ V12: num  23309 23309 23309 23309 23309 ...

> print(object.size(a), units = "Mb")
920.7 Mb

Update 1: On a whim, I converted these to matrices. The digests are the same.

> aM = as.matrix(a)
> bM= as.matrix(b)
> identical(aM,bM)
[1] TRUE
> digest(aM)
[1] "c5147d459ba385ca8f30dcd43760fc90"
> digest(bM)
[1] "c5147d459ba385ca8f30dcd43760fc90"

I then tried converting back to a data frame, and the digest values are equal (and equal to the previous value for a).

> aMF = as.data.frame(aM)
> bMF = as.data.frame(bM)
> digest(aMF)
[1] "cac56b06078733b6fb520442e5482684"
> digest(bMF)
[1] "cac56b06078733b6fb520442e5482684"

So, b looks like the bad boy, and it has a colorful past. b came from a much bigger data frame, say B. I took only the columns of B that appeared in a and checked to see if they were equal. Well, they were equal, but had different digests. I converted the column names (from "InformativeColumnName1" to "V1", etc.), just to avoid any issues that might arise - though all.equal and identical tend to point out when column names differ.

Since I am working on two different programs and don't have simultaneous access to a and b, it is easiest for me to use the digest values to check the calculations. However, something seems to be odd in how I extract columns from a data frame and then apply digest() to it.

ANSWER: It turns out, to my astonishment (dismay, horror, embarrassment, you name it), identical is very forgiving about attributes. I had assumed that only all.equal was forgiving about attributes.

This was discovered via Tommy's suggestion identical(d1, d2, attrib.as.set=FALSE). Running attributes(a) is a bad, bad idea: the deluge of row names took awhile before Ctrl-C could interrupt it. Here is the output of names(attributes()):

> names(attributes(a))
[1] "names"     "row.names" "class"    
> names(attributes(b))
[1] "names"     "class"     "row.names"

They're in different orders! Kudos to digest() for being straight with me.

UPDATE

To aid others with this problem, it seems that simply rearranging the attributes will be adequate to get identical hash values. Since tinkering with attribute orders is new to me, this may break something, but it works in my case. Note that it is a little time consuming if the objects are big; I'm not aware of a faster method for doing this. (I'm also looking to move to using matrices or data tables instead of data frames, and this may be another incentive to avoid data frames.)

tmpA0   = attributes(a)
tmpA1   = tmpA0[sort(names(tmpA0))]
a2      = a
attributes(a2) = tmpA1

tmpB0   = attributes(b)
tmpB1   = tmpB0[sort(names(tmpB0))]
b2      = b
attributes(b2) = tmpB1

digest(a2)  # e04e624692d82353479efbd713ec03f6
digest(b2)  # e04e624692d82353479efbd713ec03f6

identical(b,b2, attrib.as.set = FALSE) # FALSE
identical(b,b2, attrib.as.set = TRUE) # TRUE
identical(a2,b2, attrib.as.set = FALSE) # TRUE

437

asked Sep 28 '11 15:09

Iterator

2 Answers

Without having the actual data.frames it is of course hard to know, but one difference could be the order of the attributes. identical ignores that by default, but setting attrib.as.set=FALSE can change that:

d1 <- structure(1, foo=1, bar=2)
d2 <- structure(1, bar=2, foo=1)

identical(d1, d2) # TRUE
identical(d1, d2, attrib.as.set=FALSE) # FALSE

114

answered Sep 25 '22 07:09

Tommy

Our digest package uses the internal R function serialize() to get what we feed to the hash-generating functions (md5, sha1, ...).

So I strongly suspect that may have something like an attribute differ. Until you can construct something reproducible that does not depend on your 1e7 x 12 data set, there is little we can do.

Also, the digest() function can output intermediate results and (as of the recent 0.5.1 version) even raw vectors. That may help. Lastly, you can always contact us (as the package maintainers / authors) off-line which happens to be the recommended way within R land, the popularity of StackOverflow not withstanding.

answered Sep 23 '22 07:09

Dirk Eddelbuettel

Related questions
                            
                                Loading very large CSV dataset into Python and R, Pandas struggles
                            
                                Text color based on contrast against background
                            
                                str_replace doesn't replace all occurrences, but gsub does?
                            
                                Split violin plot with ggplot2 with quantiles
                            
                                How to add labels on top of polygons in leaflet
                            
                                join datasets using a quosure as the by argument
                            
                                Calculating MAPE in H2o: Error: Provided column type POSIXct is unknown
                            
                                Add and delete rows of DT Datatable in R Shiny
                            
                                Adjust plot title and sub-title in base R
                            
                                mapping a simple calculation over rows and lists using dplyr
                            
                                gt table - newline in cell
                            
                                Looping through a column in R
                            
                                How to turn off auto replacement in Emacs Speaks Statistics for R
                            
                                Transposing JSON list-of-dictionaries for analysis in R
                            
                                How can I partition a vector?
                            
                                Is there a better way to create quantile "dummies" / factors in R?
                            
                                R: Calculating 5 year averages in panel data
                            
                                Reshape data frame from wide to panel with multiple variables and some time invariant
                            
                                Problem with compiling RInside examples under Windows
                            
                                Using AWS for parallel processing with R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With