I have two data frames that were generated in two different occasions, but I suspect they are equal. Both have the same number of row and columns, and visually they seem to be the same, except for how the rows are ordered.
Neither has an ID column by which I could reorder, the best I can do is reorder both by a process_number variable, which is the closest I can get to a unique column. However, even after that reorder identical yields FALSE and all.equal gives me this (summarized):
[1] "Component 2: 32 string mismatches"
[16] "Component 18: 'is.NA' value mismatch: 183357 in current 183357 in target"
[23] "Component 27: Mean relative difference: 0.4688722"
[24] "Component 28: Mean relative difference: 0.0004968944"
[26] "Component 30: Attributes: < Component 2: 365 string mismatches >"
[28] "Component 31: 'current' is not a factor"
The best option I've found for these cases is to use the "compare" package:
library(compare)
compare(df1, df2, allowAll = TRUE)
The allowAll argument tries different transformations (for example, reordering rows, reordering columns, changing column types from factors to characters, and so on) and then gives you a summary of whether after different transformations, the two inputs are the same or not. If they are the same after transformations have been applied, it tells you which transformations were required to make them the same.
Your method is correct.
all.equal is telling you that your data frames are not reorderings of each other.
For more details, try examining
mismatch_in_col_2 <- data1[, 2] != data2[, 2]
cbind(data1[mismatch_in_col_2, 2], data2[mismatch_in_col_2, 2])
(Repeat for the other columns with differences.)
You mentioned that process_number "is the closest I can get to a unique column". Perhaps some of the difference relates to ties being ordered in a different way. Is there a second column you can sort on?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With