Why is running "unique" faster on a data frame than a matrix in R?

Tags:

I've begun to believe that data frames hold no advantages over matrices, except for notational convenience. However, I noticed this oddity when running unique on matrices and data frames: it seems to run faster on a data frame.

a   = matrix(sample(2,10^6,replace = TRUE), ncol = 10) b   = as.data.frame(a)  system.time({     u1 = unique(a) })  user  system elapsed 1.840   0.000   1.846   system.time({     u2 = unique(b) })  user  system elapsed 0.380   0.000   0.379

The timing results diverge even more substantially as the number of rows is increased. So, there are two parts to this question.

Why is this slower for a matrix? It seems faster to convert to a data frame, run unique, and then convert back.
Is there any reason not to just wrap unique in myUnique, which does the conversions in part #1?

Note 1. Given that a matrix is atomic, it seems that unique should be faster for a matrix, rather than slower. Being able to iterate over fixed-size, contiguous blocks of memory should generally be faster than running over separate blocks of linked lists (I assume that's how data frames are implemented...).

Note 2. As demonstrated by the performance of data.table, running unique on a data frame or a matrix is a comparatively bad idea - see the answer by Matthew Dowle and the comments for relative timings. I've migrated a lot of objects to data tables, and this performance is another reason to do so. So although users should be well served to adopt data tables, for pedagogical / community reasons I'll leave the question open for now regarding the why does this take longer on the matrix objects. The answers below address where does the time go, and how else can we get better performance (i.e. data tables). The answer to why is close at hand - the code can be found via unique.data.frame and unique.matrix. :) An English explanation of what it's doing & why is all that is lacking.

779

asked Oct 18 '11 15:10

Iterator

2 Answers

In this implementation, unique.matrix is the same as unique.array

> identical(unique.array, unique.matrix)

[1] TRUE
unique.array has to handle multi-dimensional arrays which requires additional processing to ‘collapse’ the extra dimensions (those extra calls to paste()) which are not needed in the 2-dimensional case. The key section of code is:

collapse <- (ndim > 1L) && (prod(dx[-MARGIN]) > 1L)

temp <- if (collapse) apply(x, MARGIN, function(x) paste(x, collapse = "\r"))
unique.data.frame is optimised for the 2D case, unique.matrix is not. It could be, as you suggest, it just isn't in the current implementation.

Note that in all cases (unique.{array,matrix,data.table}) where there is more than one dimension it is the string representation that is compared for uniqueness. For floating point numbers this means 15 decimal digits so

NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 2), nrow = 2)))

is 1 while

NROW(unique(a <- matrix(rep(c(1, 1+5e-15), 2), nrow = 2)))

and

NROW(unique(a <- matrix(rep(c(1, 1+4e-15), 1), nrow = 2)))

are both 2. Are you sure unique is what you want?

191

answered Oct 11 '22 13:10

Allan Engelhardt

Not sure but I guess that because matrix is one contiguous vector, R copies it into column vectors first (like a data.frame) because paste needs a list of vectors. Note that both are slow because both use paste.
Perhaps because unique.data.table is already many times faster. Please upgrade to v1.6.7 by downloading it from the R-Forge repository because that has the fix to unique you raised in this question. data.table doesn't use paste to do unique.

a = matrix(sample(2,10^6,replace = TRUE), ncol = 10) b = as.data.frame(a) system.time(u1<-unique(a))    user  system elapsed     2.98    0.00    2.99  system.time(u2<-unique(b))    user  system elapsed     0.99    0.00    0.99  c = as.data.table(b) system.time(u3<-unique(c))    user  system elapsed     0.03    0.02    0.05  # 60 times faster than u1, 20 times faster than u2 identical(as.data.table(u2),u3) [1] TRUE

answered Oct 11 '22 14:10

Matt Dowle

Related questions
                            
                                Static Vs Instance Method Performance C#
                            
                                Apache Benchmark - concurrency and number of requests
                            
                                Does Lua optimize the ".." operator?
                            
                                Optimize the performance of dictionary membership for a list of Keys
                            
                                Iteration over the rows of a Pandas DataFrame as dictionaries
                            
                                Finding the string length of a integer in .NET
                            
                                How bad is it in practice to over-nest selectors in SASS/SCSS?
                            
                                Javascript: measure code execution time online
                            
                                Java performance tips
                            
                                Performance vs Readability
                            
                                Performance concern when using LINQ "everywhere"?
                            
                                Is JavaScript string comparison just as fast as number comparison?
                            
                                OpenMP: Huge performance differences between Visual C++ 2008 and 2010
                            
                                Why is numpy.power 60x slower than in-lining?
                            
                                Performance impact of virtual inheritance
                            
                                Text CSS Rendering Performance : RGBA vs HEX vs OPACITY
                            
                                Load testing WCF service (hosted on IIS) [closed]
                            
                                Efficient way to ensure unique rows in SQLite3
                            
                                Reason behind speed of fread in data.table package in R
                            
                                How can I speed up a MySQL query with a large offset in the LIMIT clause?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is running "unique" faster on a data frame than a matrix in R?

Tags:

performance

dataframe

r

data.table

matrix

Iterator

People also ask

2 Answers

Allan Engelhardt

Matt Dowle

Recent Activity

Donate For Us