I'm sure it's something simple, but I have a data frame
df <- data.frame(a = c(1, 2, 3),
b = c(2, 3, 1),
c = c(3, 1, 4))
And I want a new data frame that contains the unique combinations of values in the rows, irrespective of which column they're in. So in the case above I'd want
a b c
1 2 3
3 1 4
I've tried
unique(df[c('a', 'b', 'c')])
but it sees (1, 2, 3) as unique from (2, 3, 1), which I don't want.
Maybe something like that
indx <- !duplicated(t(apply(df, 1, sort))) # finds non - duplicates in sorted rows
df[indx, ] # selects only the non - duplicates according to that index
# a b c
# 1 1 2 3
# 3 3 1 4
If your data.frame is quite big, the speed may be a matter for you. You can find duplicated sets much faster with the following idea.
Let's imaginary assign each possible value in rows a prime number and count products for each row. For example, for given df
we can accept primenums = c(2,3,5,7)
and count products c(30,30,70)
. Then duplicates in this products-vector correspond to duplicated sets in our data.frame. As multiplication is being computed much faster then any kinds of sorting, you can gain efficiency.
The code is following.
require("numbers")
primenums <- Primes(100)[1:4]
dfmult <- apply(as.matrix(df), 1, function(z) prod(primenums[z]) )
my_indx <- !duplicated(dfmult)
df[my_indx,]
Here we initialize vector primenums
with the help of function Primes
from package numbers
, but you can do manually in other way.
Take a look at the example. Here I show comparison of efficiency.
require("numbers")
# generate all unique combinations 10 out of 20
allcomb <- t(combn(20,10))
# make sample of 1 million rows
set.seed(789)
df <- allcomb[sample(nrow(allcomb), 1e6, T),]
# lets sort matrix to show we have duplicates
df <- df[do.call(order, lapply(1:ncol(df), function(i) df[, i])), ]
head(df, 10)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 1 2 3 4 5 6 7 8 9 10
# [2,] 1 2 3 4 5 6 7 8 9 10
# [3,] 1 2 3 4 5 6 7 8 9 10
# [4,] 1 2 3 4 5 6 7 8 9 10
# [5,] 1 2 3 4 5 6 7 8 9 11
# [6,] 1 2 3 4 5 6 7 8 9 11
# [7,] 1 2 3 4 5 6 7 8 9 11
# [8,] 1 2 3 4 5 6 7 8 9 11
# [9,] 1 2 3 4 5 6 7 8 9 11
# [10,] 1 2 3 4 5 6 7 8 9 11
# to be fair need to permutate numbers in rows before searching for identical sets
df <- t(apply(df, 1, function(z) z[sample(10,10)] ))
df <- as.data.frame(df)
names(df) <- letters[1:10]
# how does it look like now?
head(df, 10)
# a b c d e f g h i j
# 1 2 3 7 9 10 1 4 8 5 6
# 2 4 2 6 3 8 10 9 1 5 7
# 3 4 2 6 8 5 1 10 7 3 9
# 4 6 8 5 4 2 1 10 9 7 3
# 5 11 2 7 6 8 1 9 4 5 3
# 6 9 6 3 11 4 2 8 7 5 1
# 7 5 2 3 11 1 8 6 9 7 4
# 8 3 9 7 1 2 5 4 8 11 6
# 9 6 2 8 3 4 1 11 5 9 7
# 10 4 6 3 9 7 2 1 5 11 8
# now lets shuffle rows to make df more plausible
df <- df[sample(nrow(df), nrow(df)),]
Now when data.frame is ready we can test different algorithms.
system.time(indx <- !duplicated(t(apply(df, 1, sort))) )
# user system elapsed
# 119.75 0.06 120.03
# doesn't impress, frankly speaking
library(sets)
system.time(indx <- !duplicated(apply(df, 1, as.set)) )
# user system elapsed
# 91.60 0.00 91.89
# better, but we want faster! =)
# now lets check out the method with prime numbers
primenums <- Primes(100)[1:20]
# [1] 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71
system.time({
dfmult <- apply(as.matrix(df), 1, function(z) prod(primenums[z]) )
my_indx <- !duplicated(dfmult) })
# user system elapsed
# 6.44 0.16 6.61
# not bad, isn't it? but lets compare results
identical(indx, my_indx)
# [1] TRUE
# So, if there is no difference, why wait more? ;)
There is one important assumption here - we use as.matrix(df)
, but what if there are not only numeric variables in our data.frame
? A more unificated solution will be as follows:
system.time({
dfmult <- apply(
apply(df, 2, function(colmn) as.integer(factor(colmn,
levels = unique(c(as.matrix(df)))))),
1, function(z) prod(primenums[z]) )
my_indx <- !duplicated(dfmult) })
# user system elapsed
# 27.48 0.34 27.84
# is distinctly slower but still much faster then previous methods
And what about if we have very much columns or very much different variables? In this case instead of prod()
we can use sum(log())
(which is being computed probably even faster for large numbers). Take a look at this.
pr <- Primes(5e7)
length(pr)
# [1] 3001134
system.time(N <- sum(log(pr)))
# user system elapsed
# 0.12 0.00 0.13
N
# [1] 49993718
It's hard to imagine df
with 3 mln columns, but here it's ok. This way allows us to carry df
of any incredibly huge size with as many columns our RAM can hold.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With