Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining low frequency counts

Trying to collapse a nominal categorical vector by combining low frequency counts into an 'Other' category:

The data (column of a dataframe) looks like this, and contains information for all 50 states:

California
Florida
Alabama
...

table(colname)/length(colname)correctly returns the frequencies, and what I'm trying to do is to lump anything below a given threshold (say f=0.02) together. What is the correct approach?

like image 660
F. R Avatar asked Dec 20 '15 18:12

F. R


4 Answers

From the sounds of it, something like the following should work for you:

condenseMe <- function(vector, threshold = 0.02, newName = "Other") {
  toCondense <- names(which(prop.table(table(vector)) < threshold))
  vector[vector %in% toCondense] <- newName
  vector
}

Try it out:

## Sample data
set.seed(1)
a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE)))

round(prop.table(table(a)), 2)
# a
#    a    A    b    B    c    C    d    D    e    E    f    g    h 
# 0.07 0.02 0.07 0.02 0.10 0.02 0.10 0.02 0.12 0.02 0.07 0.12 0.13 
#    i    j 
# 0.08 0.07 

a
#  [1] "c" "d" "d" "e" "j" "h" "c" "h" "g" "i" "g" "d" "f" "D" "g" "h"
# [17] "h" "a" "b" "h" "e" "g" "h" "b" "d" "e" "e" "g" "i" "f" "d" "e"
# [33] "g" "c" "g" "a" "B" "i" "i" "b" "i" "j" "f" "d" "c" "h" "E" "j"
# [49] "j" "c" "C" "e" "f" "a" "a" "h" "e" "c" "A" "b"

condenseMe(a)
#  [1] "c"     "d"     "d"     "e"     "j"     "h"     "c"     "h"    
#  [9] "g"     "i"     "g"     "d"     "f"     "Other" "g"     "h"    
# [17] "h"     "a"     "b"     "h"     "e"     "g"     "h"     "b"    
# [25] "d"     "e"     "e"     "g"     "i"     "f"     "d"     "e"    
# [33] "g"     "c"     "g"     "a"     "Other" "i"     "i"     "b"    
# [41] "i"     "j"     "f"     "d"     "c"     "h"     "Other" "j"    
# [49] "j"     "c"     "Other" "e"     "f"     "a"     "a"     "h"    
# [57] "e"     "c"     "Other" "b"   

Note, however, that if you are dealing with factors, you should convert them with as.character first.

like image 160
A5C1D2H2I1M1N2O1R2T1 Avatar answered Nov 20 '22 14:11

A5C1D2H2I1M1N2O1R2T1


Hadley Wickham's forcats package (available on CRAN since 2016-08-29) has a handy function fct_lump() which lumps together levels of a factor according to different criteria.

OP's requirement to lump together factors below a threshold of 0.02 can be achieved by

set.seed(1)
a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE)))
forcats::fct_lump(a, prop = 0.02)
 [1] c     d     d     e     j     h     c     h     g     i     g     d    
[13] f     Other g     h     h     a     b     h     e     g     h     b    
[25] d     e     e     g     i     f     d     e     g     c     g     a    
[37] Other i     i     b     i     j     f     d     c     h     Other j    
[49] j     c     Other e     f     a     a     h     e     c     Other b    
Levels: a b c d e f g h i j Other

Note that the sample data from this answer has been used for comparison.


The function offers even more possibilities, e.g., it can keep the 5 factor levels with the lowest frequencies and lumps together the other levels:

forcats::fct_lump(a, n = -5)
 [1] Other Other Other Other Other Other Other Other Other Other Other Other
[13] Other D     Other Other Other Other Other Other Other Other Other Other
[25] Other Other Other Other Other Other Other Other Other Other Other Other
[37] B     Other Other Other Other Other Other Other Other Other E     Other
[49] Other Other C     Other Other Other Other Other Other Other A     Other
Levels: A B C D E Other
like image 41
Uwe Avatar answered Nov 20 '22 14:11

Uwe


A little late to the game, but you may use my package DataExplorer. The group_category function is exactly what you are looking for. There are other options too, you can type ?group_category to find out more.

For example, in your case:

library(DataExplorer)
group_category(data, "colname", 0.02, update = TRUE)

Here are more examples.

like image 4
Boxuan Avatar answered Nov 20 '22 13:11

Boxuan


I used an upadated version of the condense me function:

condenseMe <- function(vector, name, limit) {

  toCondense <- names(which(prop.table(table(vector)) < limit))
  levels(vector)[levels(vector) %in% toCondense] <- name

  vector
}

Note: If among the levels there is NA set as level, in some cases condenseMe function will replace NA level with NA as missing values. That´s what happened to me.

like image 2
Iryna Pazharytskaya Avatar answered Nov 20 '22 13:11

Iryna Pazharytskaya