Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding unique combinations irrespective of position [duplicate]

I'm sure it's something simple, but I have a data frame

      df <- data.frame(a = c(1, 2, 3),
                       b = c(2, 3, 1),
                       c = c(3, 1, 4))

And I want a new data frame that contains the unique combinations of values in the rows, irrespective of which column they're in. So in the case above I'd want

    a b c
    1 2 3
    3 1 4

I've tried

    unique(df[c('a', 'b', 'c')])

but it sees (1, 2, 3) as unique from (2, 3, 1), which I don't want.

like image 418
Thomas White Avatar asked Apr 23 '15 09:04

Thomas White


2 Answers

Maybe something like that

indx <- !duplicated(t(apply(df, 1, sort))) # finds non - duplicates in sorted rows
df[indx, ] # selects only the non - duplicates according to that index
#   a b c
# 1 1 2 3
# 3 3 1 4
like image 195
David Arenburg Avatar answered Sep 21 '22 14:09

David Arenburg


If your data.frame is quite big, the speed may be a matter for you. You can find duplicated sets much faster with the following idea.

Let's imaginary assign each possible value in rows a prime number and count products for each row. For example, for given df we can accept primenums = c(2,3,5,7) and count products c(30,30,70). Then duplicates in this products-vector correspond to duplicated sets in our data.frame. As multiplication is being computed much faster then any kinds of sorting, you can gain efficiency. The code is following.

require("numbers")
primenums <- Primes(100)[1:4]
dfmult <- apply(as.matrix(df), 1, function(z) prod(primenums[z]) )
my_indx <- !duplicated(dfmult)
df[my_indx,]

Here we initialize vector primenums with the help of function Primes from package numbers, but you can do manually in other way.

Take a look at the example. Here I show comparison of efficiency.

require("numbers")

# generate all unique combinations 10 out of 20
allcomb <- t(combn(20,10))
# make sample of 1 million rows
set.seed(789)
df <- allcomb[sample(nrow(allcomb), 1e6, T),]
# lets sort matrix to show we have duplicates
df <- df[do.call(order, lapply(1:ncol(df), function(i) df[, i])), ]
head(df, 10)
#       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#  [1,]    1    2    3    4    5    6    7    8    9    10
#  [2,]    1    2    3    4    5    6    7    8    9    10
#  [3,]    1    2    3    4    5    6    7    8    9    10
#  [4,]    1    2    3    4    5    6    7    8    9    10
#  [5,]    1    2    3    4    5    6    7    8    9    11
#  [6,]    1    2    3    4    5    6    7    8    9    11
#  [7,]    1    2    3    4    5    6    7    8    9    11
#  [8,]    1    2    3    4    5    6    7    8    9    11
#  [9,]    1    2    3    4    5    6    7    8    9    11
# [10,]    1    2    3    4    5    6    7    8    9    11

# to be fair need to permutate numbers in rows before searching for identical sets
df <- t(apply(df, 1, function(z) z[sample(10,10)] ))
df <- as.data.frame(df)
names(df) <- letters[1:10]
# how does it look like now?
head(df, 10)
#     a b c  d  e  f  g h  i j
# 1   2 3 7  9 10  1  4 8  5 6
# 2   4 2 6  3  8 10  9 1  5 7
# 3   4 2 6  8  5  1 10 7  3 9
# 4   6 8 5  4  2  1 10 9  7 3
# 5  11 2 7  6  8  1  9 4  5 3
# 6   9 6 3 11  4  2  8 7  5 1
# 7   5 2 3 11  1  8  6 9  7 4
# 8   3 9 7  1  2  5  4 8 11 6
# 9   6 2 8  3  4  1 11 5  9 7
# 10  4 6 3  9  7  2  1 5 11 8

# now lets shuffle rows to make df more plausible
df <- df[sample(nrow(df), nrow(df)),]

Now when data.frame is ready we can test different algorithms.

system.time(indx <- !duplicated(t(apply(df, 1, sort))) )
#   user  system elapsed 
# 119.75    0.06  120.03
# doesn't impress, frankly speaking

library(sets)
system.time(indx <- !duplicated(apply(df, 1, as.set)) )
#  user  system elapsed 
# 91.60    0.00   91.89
# better, but we want faster! =)

# now lets check out the method with prime numbers
primenums <- Primes(100)[1:20]
# [1]  2  3  5  7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71
system.time({
  dfmult <- apply(as.matrix(df), 1, function(z) prod(primenums[z]) )
  my_indx <- !duplicated(dfmult) })
# user  system elapsed 
# 6.44    0.16    6.61
# not bad, isn't it? but lets compare results
identical(indx, my_indx)
# [1] TRUE

# So, if there is no difference, why wait more? ;)

There is one important assumption here - we use as.matrix(df), but what if there are not only numeric variables in our data.frame? A more unificated solution will be as follows:

system.time({
  dfmult <- apply(
    apply(df, 2, function(colmn) as.integer(factor(colmn, 
                                                   levels = unique(c(as.matrix(df)))))),
    1, function(z) prod(primenums[z]) )
  my_indx <- !duplicated(dfmult) })
#  user  system elapsed 
# 27.48    0.34   27.84
# is distinctly slower but still much faster then previous methods

And what about if we have very much columns or very much different variables? In this case instead of prod() we can use sum(log()) (which is being computed probably even faster for large numbers). Take a look at this.

pr <- Primes(5e7)
length(pr)   
# [1] 3001134
system.time(N <- sum(log(pr)))
# user  system elapsed 
# 0.12    0.00    0.13
N
# [1] 49993718

It's hard to imagine df with 3 mln columns, but here it's ok. This way allows us to carry df of any incredibly huge size with as many columns our RAM can hold.

like image 40
inscaven Avatar answered Sep 17 '22 14:09

inscaven