Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I quickly find out whether two (large) factors are relabelings of each other?

Tags:

r

r-factor

I have two vectors of factors and suspect that they carry the same information up to relabeling. How can I find out whether this is correct?

My problem is that both vectors are pretty long (200,000 entries), with a large number of levels (4,000). Some levels are very frequent, but there is a "long tail" of levels that only occur once each.

Here is a reproducible example (sorry, I couldn't find a way to compactify it and still show the properties of my data):

foo <- structure(c(3213L, 428L, 104L, 59L, 23L, 17L, 15L, 9L, 5L, 6L, 
1L, 5L, 3L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Dim = 69L, .Dimnames = structure(list(
    c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", 
    "12", "13", "14", "15", "16", "23", "33", "83", "205", "246", 
    "255", "319", "374", "379", "389", "552", "566", "595", "686", 
    "750", "846", "965", "999", "1006", "1254", "1514", "1535", 
    "1605", "1687", "1744", "1792", "1937", "1946", "2166", "2198", 
    "2206", "2420", "2503", "2736", "2965", "2986", "3036", "3273", 
    "3734", "4026", "4073", "4279", "5038", "5040", "5185", "5607", 
    "6298", "6609", "6930", "15392", "21083", "22933", "29357"
    )), .Names = ""), class = "table")
bar <- as.numeric(rep(names(foo),times=foo))
factor.1 <- as.factor(rep(paste0("a",sprintf("%04i",1:length(bar))),times=bar))
set.seed(1)
factor.2 <- as.factor(sample(gsub("a","b",unique(factor.1)),length(unique(factor.1)))[
  as.numeric(factor.1)])

After this exercise, factor.1 and factor.2 are simply relabelings of each other. So, how can we find out whether this holds for new vectors?

Things that don't work:

  1. The internal integer coding does not need to be the same, so simply checking whether cor(as.numeric(factor.1),as.numeric(factor.2))==1 will not work.

  2. I tried checking whether to each factor level of factor.1 there corresponds exactly one factor level of factor.2 and vice versa. Unfortunately, this takes way too long, on the order of hours:

    foo <- by(factor.1,factor.2,FUN=function(zz)length(unique(zz)))
    bar <- by(factor.2,factor.1,FUN=function(zz)length(unique(zz)))
    all(foo) & all(bar)
    
  3. If we can perfectly fit factor.1 in a multinomial model using factor.2 as a predictor and vice versa, both carry the same information. Unfortunately, nnet::multinom(factor.1~factor.2) yields the dreaded "cannot allocate a vector of size XX" error. randomForest::randomForest(), which would at least give us a probabilistic answer, cannot handle factors with more than 53 levels.

  4. We could run table(factor.1,factor.2) and check whether every row has exactly one non-zero entry. Which again runs out of memory.

like image 324
Stephan Kolassa Avatar asked Sep 08 '15 16:09

Stephan Kolassa


1 Answers

The first function counts the number of unique elements of its argument and the second returns TRUE if there is one level of x for every level of y. If that is so for factor.1 and factor.2 and if they use the same number of levels one is a relabeling of the other. With the given data it returns immediately so it seems pretty fast. The last line is a faster version of one of your ideas. Use either one.

cnt <- function(x) length(unique(x))
all_one <- function(x, y) all(tapply(unclass(x), y, cnt) == 1)

# solution 1
all_one(factor.1, factor.2) && cnt(factor.1) == cnt(factor.2)

# solution 2
all_one(factor.1, factor.2) && all_one(factor.2, factor.1)
like image 151
G. Grothendieck Avatar answered Nov 15 '22 09:11

G. Grothendieck