Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Test whether a dataframe is a sorted version of another dataframe

Is it feasible to test whether some dataframe is simply a sorted version of another dataframe? For example, if I have two dataframes a and b, is there some way to easily determine whether a is simply a reordered version of b (or vice versa)?

Here's a trivial example:

a <- data.frame(x1=1:10, x2=11:20, x3=1:2)
b <- a[order(a$x3, a$x1, decreasing=TRUE),]

The closest thing I can think of is all.equal, but its output is not helpful (to me, at least):

> all.equal(a,b)
[1] "Attributes: < Component 2: Mean relative difference: 0.9545455 >"
[2] "Component 1: Mean relative difference: 0.9545455"                
[3] "Component 2: Mean relative difference: 0.3387097"                
[4] "Component 3: Mean relative difference: 0.6666667"

I imagine there is some obvious way to do this that is alluding me. I'm looking for a general solution that would scale well to many variables and many observations (thus the above example is simply for demonstration).

Also: Ideally, such a function would also identify whether a is a subset of b (or vice versa).

like image 296
Thomas Avatar asked Dec 07 '13 16:12

Thomas


People also ask

How do you check if a DataFrame is sorted?

To check if the index of a DataFrame is sorted in ascending order use the is_monotonic_increasing property. Similarly, to check for descending order use the is_monotonic_decreasing property.

How do you check if two data frames are exactly the same?

DataFrame - equals() function The equals() function is used to test whether two objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.

How do you compare two Pandas Series?

In the pandas series constructor, there is a method called gt() which is used to apply the Greater Than condition between elements of two pandas series objects. The result of the gt() method is based on the comparison between elements of two series objects.


2 Answers

I would explore the "compare" package:

library(compare)
compare(a, b, allowAll=TRUE)
# TRUE
#   sorted

Here, it shows that it had to sort the data before it found the data to be the same.

Here's a slightly more complicated example, with factors coerced to character, rows reordered, and columns reordered:

a <- data.frame(x1=1:10, x2=11:20, x3=1:2, x4 = letters[1:10])
b <- with(a, a[order(x3, x1, decreasing=TRUE), ])
b$x4 <- as.character(b$x4)
b <- b[c(4, 1, 3, 2)]

Here's the result of compare:

compare(a, b, allowAll=TRUE)
# TRUE
#   reordered columns
#   [x4] coerced from <character> to <factor>
#   sorted
like image 54
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 04 '22 16:10

A5C1D2H2I1M1N2O1R2T1


You can sort both data frames along all columns and use identical:

identical(a[do.call(order, a), ], b[do.call(order, b), ])
#[1] TRUE
like image 29
Sven Hohenstein Avatar answered Oct 04 '22 15:10

Sven Hohenstein