Trying to collapse a nominal categorical vector by combining low frequency counts into an 'Other' category:
The data (column of a dataframe) looks like this, and contains information for all 50 states:
California
Florida
Alabama
...
table(colname)/length(colname)
correctly returns the frequencies, and what I'm trying to do is to lump anything below a given threshold (say f=0.02) together. What is the correct approach?
From the sounds of it, something like the following should work for you:
condenseMe <- function(vector, threshold = 0.02, newName = "Other") {
toCondense <- names(which(prop.table(table(vector)) < threshold))
vector[vector %in% toCondense] <- newName
vector
}
Try it out:
## Sample data
set.seed(1)
a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE)))
round(prop.table(table(a)), 2)
# a
# a A b B c C d D e E f g h
# 0.07 0.02 0.07 0.02 0.10 0.02 0.10 0.02 0.12 0.02 0.07 0.12 0.13
# i j
# 0.08 0.07
a
# [1] "c" "d" "d" "e" "j" "h" "c" "h" "g" "i" "g" "d" "f" "D" "g" "h"
# [17] "h" "a" "b" "h" "e" "g" "h" "b" "d" "e" "e" "g" "i" "f" "d" "e"
# [33] "g" "c" "g" "a" "B" "i" "i" "b" "i" "j" "f" "d" "c" "h" "E" "j"
# [49] "j" "c" "C" "e" "f" "a" "a" "h" "e" "c" "A" "b"
condenseMe(a)
# [1] "c" "d" "d" "e" "j" "h" "c" "h"
# [9] "g" "i" "g" "d" "f" "Other" "g" "h"
# [17] "h" "a" "b" "h" "e" "g" "h" "b"
# [25] "d" "e" "e" "g" "i" "f" "d" "e"
# [33] "g" "c" "g" "a" "Other" "i" "i" "b"
# [41] "i" "j" "f" "d" "c" "h" "Other" "j"
# [49] "j" "c" "Other" "e" "f" "a" "a" "h"
# [57] "e" "c" "Other" "b"
Note, however, that if you are dealing with factor
s, you should convert them with as.character
first.
Hadley Wickham's forcats
package (available on CRAN since 2016-08-29) has a handy function fct_lump()
which lumps together levels of a factor according to different criteria.
OP's requirement to lump together factors below a threshold of 0.02 can be achieved by
set.seed(1)
a <- sample(c("A", "B", "C", "D", "E", sample(letters[1:10], 55, TRUE)))
forcats::fct_lump(a, prop = 0.02)
[1] c d d e j h c h g i g d [13] f Other g h h a b h e g h b [25] d e e g i f d e g c g a [37] Other i i b i j f d c h Other j [49] j c Other e f a a h e c Other b Levels: a b c d e f g h i j Other
Note that the sample data from this answer has been used for comparison.
The function offers even more possibilities, e.g., it can keep the 5 factor levels with the lowest frequencies and lumps together the other levels:
forcats::fct_lump(a, n = -5)
[1] Other Other Other Other Other Other Other Other Other Other Other Other [13] Other D Other Other Other Other Other Other Other Other Other Other [25] Other Other Other Other Other Other Other Other Other Other Other Other [37] B Other Other Other Other Other Other Other Other Other E Other [49] Other Other C Other Other Other Other Other Other Other A Other Levels: A B C D E Other
A little late to the game, but you may use my package DataExplorer. The group_category
function is exactly what you are looking for. There are other options too, you can type ?group_category
to find out more.
For example, in your case:
library(DataExplorer)
group_category(data, "colname", 0.02, update = TRUE)
Here are more examples.
I used an upadated version of the condense me function:
condenseMe <- function(vector, name, limit) {
toCondense <- names(which(prop.table(table(vector)) < limit))
levels(vector)[levels(vector) %in% toCondense] <- name
vector
}
Note: If among the levels there is NA set as level, in some cases condenseMe function will replace NA level with NA as missing values. That´s what happened to me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With