Anyone know if the incomparables
argument of unique()
or duplicated()
has ever been implemented beyond incomparables=FALSE
?
Maybe I don't understand how it is supposed to work...
Anyway I'm looking for a slick solution to keep only unique columns (or rows) that are identical to another column besides extra NA
s? I can brute force it using cor()
for example, but for tens of thousands of columns, this is intractable.
Heres an example, sorry if its a little messy, but I think it illustrates the point. Make some matrix z
:
z <- matrix(sample(c(1:3, NA), 100, replace=TRUE), 10, 10)
colnames(z) <- paste("c", 1:10, sep="")
rownames(z) <- paste("r",1:10, sep="")
lets add a couple duplicate columns with extra NA
s, and randomize the columns, (that way they aren't always at the end).
c3.1 <- z[, 3]
c3.1[sample(1:10, 3)] <- NA
c8.1 <- z[, 8]
c8.1[sample(1:10, 5)] <- NA
z <- cbind(z, c3.1, c8.1)
z <- z[, sample(1:ncol(z))]
So I could sort by the number missing, then it would seem as though duplicated()
or unique()
would work, but it doesn't like to ignore missing.
missing <- apply(z, 2, function(x) {length(which(is.na(x)))})
z.sorted <- z[, order(missing)]
z.sorted[,!duplicated(z.sorted,MARGIN=2)]
unique(z.sorted,MARGIN=2)
I figured this is what the incomparables
argument was specifically for, but it doesn't appear to be implemented yet:
z.sorted[,!duplicated(z.sorted,MARGIN=2,incomparables=NA)]
unique(z.sorted,MARGIN=2,incomparables=NA)
I know I will likely find a less elegant solution soon enough, I guess I'm more asking about why this hasn't been implemented yet? or if I'm just using it wrong. Seems I run into this quite often, yet I searched around for quite a while without finding answer. Any thoughts?
As you suspect, for the data.frame
and matrix
methods of unique
, incomparables != FALSE
is not yet implemented. It is implemented in the default method, which is used for vectors without dims. E.g.:
unique(c(1, 2, 2, 3, 3, 3, NA, NA, NA), incomparables=2)
## [1] 1 2 2 3 NA
unique(c(1, 2, 2, 3, 3, 3, NA, NA, NA), incomparables=NA)
## [1] 1 2 3 NA NA NA
Take a look at the source of unique.matrix
versus unique.default
(just type the function names into the console and hit Enter
, or press F2
in RStudio ro open the source in a new pane).
In your case, you could use outer
to create a matrix indicating whether particular pairs of rows/columns are the same or not, disregarding NA
s.
same <- outer(seq_len(ncol(z)), seq_len(ncol(z)),
Vectorize(function(x, y) all(z[, x]==z[, y], na.rm=TRUE)))
same
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [4,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [10,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Then, if you want to keep only those columns that are the same as, e.g., the second column (which is column c8.1
for me - see bottom of this post for the full z
matrix I used), you can do:
z[, same[2, ]] # or, equivalently, z[, same[, 2]]
## c8.1 c8
## r1 2 2
## r2 1 1
## r3 NA 3
## r4 NA 1
## r5 3 3
## r6 NA 1
## r7 2 2
## r8 NA 1
## r9 3 3
## r10 NA 1
To reduce the matrix to the set of columns that is unique (ignoring NA
), and has the least number of NA
s, you can then do:
z[, unique(sapply(apply(same, 2, which), function(x)
x[which.min(colSums(is.na(z))[x])]))]
## c7 c8 c3 c1 c6 c10 c2 c9 c4
## r1 2 2 1 2 1 1 1 2 NA
## r2 3 1 3 1 3 NA 1 2 2
## r3 2 3 2 3 1 NA 2 1 NA
## r4 2 1 1 2 2 1 3 NA 2
## r5 NA 3 2 1 3 2 NA NA 3
## r6 2 1 2 2 1 1 2 1 NA
## r7 2 2 2 2 NA 3 1 2 2
## r8 NA 1 1 3 2 NA 1 NA 1
## r9 1 3 3 2 NA 2 1 NA 2
## r10 NA 1 1 NA 1 1 1 2 3
For reference, here is the z
I was working with:
c7 c8.1 c3 c1 c5 c10 c8 c6 c2 c3.1 c9 c4
r1 2 2 1 2 1 1 2 1 1 1 2 NA
r2 3 1 3 1 3 NA 1 3 1 3 2 2
r3 2 NA 2 3 1 NA 3 1 2 2 1 NA
r4 2 NA 1 2 NA 1 1 2 3 NA NA 2
r5 NA 3 2 1 3 2 3 3 NA 2 NA 3
r6 2 NA 2 2 1 1 1 1 2 2 1 NA
r7 2 2 2 2 1 3 2 NA 1 2 2 2
r8 NA NA 1 3 NA NA 1 2 1 NA NA 1
r9 1 3 3 2 1 2 3 NA 1 NA NA 2
r10 NA NA 1 NA NA 1 1 1 1 1 2 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With