Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: factor levels, recode rest to 'other'

Tags:

r

r-factor

I use factors somewhat infrequently and generally find them comprehensible, but I often am fuzzy about the details for specific operations. Currently, I am coding/collapsing categories with few observations into "other" and am looking for a quick way to do that--I have a perhaps 20 levels of a variable, but am interested in collapsing a bunch of them to one.

data <- data.frame(employees = sample.int(1000,500),
                   naics = sample(c('621111','621112','621210','621310','621320','621330','621340','621391','621399','621410','621420','621491','621492','621493','621498','621511','621512','621610','621910','621991','621999'),
                                  100, replace=T))

Here are my levels of interest, and their labels in separate vectors.

#levels and labels
top8 <-c('621111','621210','621399','621610','621330',
         '621310','621511','621420','621320')
top8_desc <- c('Offices of physicians',
               'Offices of dentists',
               'Offices of all other miscellaneous health practitioners',
               'Home health care services',
               'Offices of Mental Health Practitioners',
               'Offices of chiropractors',
               'Medical Laboratories',
               'Outpatient Mental Health and Substance Abuse Centers',
               'Offices of optometrists')

I could use the factor() call, enumerate them all, classifying as "other" for each time a category had few observations.

Assuming that the top8 and top8_desc above are the actual top 8, what is the best way to declare data$naics as a factor variable so that the values in top8 are correcly coded and everything else is recoded as other?

like image 603
ako Avatar asked Mar 20 '13 20:03

ako


People also ask

How do I change factor levels in R?

How do I Rename Factor Levels in R? The simplest way to rename multiple factor levels is to use the levels() function. For example, to recode the factor levels “A”, “B”, and “C” you can use the following code: levels(your_df$Category1) <- c("Factor 1", "Factor 2", "Factor 3") .

Why do we convert to factors in R?

Factors represent a very efficient way to store character values, because each unique character value is stored only once, and the data itself is stored as a vector of integers. Because of this, read. table will automatically convert character variables to factors unless the as.is= argument is specified.

What does factor R mean?

What is Factor in R? Factor in R is a variable used to categorize and store the data, having a limited number of different values. It stores the data as a vector of integer values. Factor in R is also known as a categorical variable that stores both string and integer data values as levels.


1 Answers

I think the easiest way is to relabel all the naics not in the top 8 to a special value.

data$naics[!(data$naics %in% top8)] = -99

Then you can use the "exclude" option when turning it into a factor

factor(data$naics, exclude=-99)
like image 199
kith Avatar answered Nov 22 '22 06:11

kith