Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select all factor variables in R

Tags:

r

I have a data frame named "insurance" with both numerical and factor variables. How can I select all factor variables so that I can check the levels of the categorical variables?

I tried sapply(insurance,class) to get the the classes of all variables. But then I can't make logical argument based on if class(var)="factor" as the variable names are also included in the result of sapply().

Thanks,

like image 739
Gold Waterson Avatar asked Jul 28 '13 11:07

Gold Waterson


People also ask

How do I select all factor columns in R?

First select the factor columns and then use purrr::map to show the factor levels for each column. Show activity on this post. I wasn't the first downvote, but the reason for adding negativity is to discourage you from using apply when it is not appropriate.

How do I select all variables in a dataset in R?

You can shift-click to select a range of variables, you can hold shift and press the down key to select one or more variables, and so on. And then you can press Paste and the command with extracted variable names is pasted into your script editor.

How do I list all variables in a column in R?

You can use ls() to list all variables that are created in the environment. Use ls() to display all variables.

What does factor () do in R?

Factor in R is a variable used to categorize and store the data, having a limited number of different values. It stores the data as a vector of integer values. Factor in R is also known as a categorical variable that stores both string and integer data values as levels.


2 Answers

Some data:

insurance <- data.frame(
  int   = 1:5,
  fact1 = letters[1:5],
  fact2 = factor(1:5),
  fact3 = LETTERS[3:7]
)

I would use sapply like you did, but combined with is.factor to return a logical vector:

is.fact <- sapply(insurance, is.factor)
#   int fact1 fact2 fact3 
# FALSE  TRUE  TRUE  TRUE

Then use [ to extract these columns:

factors.df <- insurance[, is.fact]
#   fact1 fact2 fact3
# 1     a     1     C
# 2     b     2     D
# 3     c     3     E
# 4     d     4     F
# 5     e     5     G

Finally, to get the levels, use lapply:

lapply(factors.df, levels)
# $fact1
# [1] "a" "b" "c" "d" "e"
# 
# $fact2
# [1] "1" "2" "3" "4" "5"
# 
# $fact3
# [1] "C" "D" "E" "F" "G"

You might also find str(insurance) interesting as a short summary.

like image 144
flodel Avatar answered Oct 05 '22 00:10

flodel


This (almost) appears the perfect time to use the seldom-used function rapply

rapply(insurance, class = "factor", f = levels, how = "list")

Or

Filter(Negate(is.null),rapply(insurance, class = "factor", f = levels, how = "list"))

To remove the NULL elements (that weren't factors)

Or simply

lapply(Filter(is.factor,insurance), levels))
like image 30
mnel Avatar answered Oct 05 '22 00:10

mnel