I've begun to believe that data frames hold no advantages over matrices, except for notational convenience. However, I noticed this oddity when running unique
on matrices and data frames: it seems to run faster on a data frame.
a = matrix(sample(2,10^6,replace = TRUE), ncol = 10) b = as.data.frame(a) system.time({ u1 = unique(a) }) user system elapsed 1.840 0.000 1.846 system.time({ u2 = unique(b) }) user system elapsed 0.380 0.000 0.379
The timing results diverge even more substantially as the number of rows is increased. So, there are two parts to this question.
Why is this slower for a matrix? It seems faster to convert to a data frame, run unique
, and then convert back.
Is there any reason not to just wrap unique
in myUnique
, which does the conversions in part #1?
Note 1. Given that a matrix is atomic, it seems that unique
should be faster for a matrix, rather than slower. Being able to iterate over fixed-size, contiguous blocks of memory should generally be faster than running over separate blocks of linked lists (I assume that's how data frames are implemented...).
Note 2. As demonstrated by the performance of data.table
, running unique
on a data frame or a matrix is a comparatively bad idea - see the answer by Matthew Dowle and the comments for relative timings. I've migrated a lot of objects to data tables, and this performance is another reason to do so. So although users should be well served to adopt data tables, for pedagogical / community reasons I'll leave the question open for now regarding the why does this take longer on the matrix objects. The answers below address where does the time go, and how else can we get better performance (i.e. data tables). The answer to why is close at hand - the code can be found via unique.data.frame
and unique.matrix
. :) An English explanation of what it's doing & why is all that is lacking.
Data frames can do lot of works like fit statistics formulas. Processing data(Not possible with Matrix, First converting to Data Frame is mandatory). Transpose is possible, i.e. changing rows to columns and vice versa which is useful in Data Science.
In a data frame the columns contain different types of data, but in a matrix all the elements are the same type of data. A matrix in R is like a mathematical matrix, containing all the same type of thing (usually numbers). R often but not always lets these be used interchangably.
The unique() function in R is used to eliminate or delete the duplicate values or the rows present in the vector, data frame, or matrix as well. The unique() function found its importance in the EDA (Exploratory Data Analysis) as it directly identifies and eliminates the duplicate values in the data.
Part of the answer is contained already in your question: You use data frames if columns (variables) can be expected to be of different types (numeric/character/logical etc.). Matrices are for data of the same type. Consequently, the choice matrix/data. frame is only problematic if you have data of the same type.
In this implementation, unique.matrix
is the same as unique.array
> identical(unique.array, unique.matrix)
[1] TRUE
unique.array
has to handle multi-dimensional arrays which requires additional processing to ‘collapse’ the extra dimensions (those extra calls to paste()
) which are not needed in the 2-dimensional case. The key section of code is:
collapse <- (ndim > 1L) && (prod(dx[-MARGIN]) > 1L)
temp <- if (collapse) apply(x, MARGIN, function(x) paste(x, collapse = "\r"))
unique.data.frame
is optimised for the 2D case, unique.matrix
is not. It could be, as you suggest, it just isn't in the current implementation.
Note that in all cases (unique.{array,matrix,data.table}) where there is more than one dimension it is the string representation that is compared for uniqueness. For floating point numbers this means 15 decimal digits so
NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 2), nrow = 2)))
is 1
while
NROW(unique(a <- matrix(rep(c(1, 1+5e-15), 2), nrow = 2)))
and
NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 1), nrow = 2)))
are both 2
. Are you sure unique
is what you want?
Not sure but I guess that because matrix
is one contiguous vector, R copies it into column vectors first (like a data.frame
) because paste
needs a list of vectors. Note that both are slow because both use paste
.
Perhaps because unique.data.table
is already many times faster. Please upgrade to v1.6.7 by downloading it from the R-Forge repository because that has the fix to unique
you raised in this question. data.table
doesn't use paste
to do unique
.
a = matrix(sample(2,10^6,replace = TRUE), ncol = 10) b = as.data.frame(a) system.time(u1<-unique(a)) user system elapsed 2.98 0.00 2.99 system.time(u2<-unique(b)) user system elapsed 0.99 0.00 0.99 c = as.data.table(b) system.time(u3<-unique(c)) user system elapsed 0.03 0.02 0.05 # 60 times faster than u1, 20 times faster than u2 identical(as.data.table(u2),u3) [1] TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With