Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Set ordering of factor levels for multiple columns in a data frame

Tags:

r

I've loaded data from a CSV file into a data frame. Each column represents a survey question, and all of the answers are on a five-point Likert scale, with the labels: ("None", "Low", "Medium", "High", "Very High").

When I read in the data initially, R correctly interprets those values as factors but doesn't know what the ordering should be. I want to specify what the ordering is for the values so I can do some numerical calculations. I thought the following code would work:

X <- read.csv('..')
likerts <- data.frame(apply(X, 2, function(X){factor(X, 
             levels = c("None", "Low", "Medium", "High", "Very High"), 
             ordered = T)}))

What happens instead is that all of the level data gets converted into strings. How do I do this correctly?

like image 671
Lorin Hochstein Avatar asked Dec 12 '22 16:12

Lorin Hochstein


2 Answers

When using data.frame, R will convert again to a normal factor (or if stringsAsFactors = FALSE to string). Use as.data.frame instead. A trivial example with a toy data-frame:

X <- data.frame(
  var1=rep(letters[1:5],3),
  var2=rep(letters[1:5],each=3)

)
likerts <- as.data.frame(lapply(X, function(X){ordered(X,
              levels = letters[5:1],labels=letters[5:1])}))

> str(likerts)
'data.frame':   15 obs. of  2 variables:
 $ var1: Ord.factor w/ 5 levels "e"<"d"<"c"<"b"<..: 5 4 3 2 1 5 4 3 2 1 ...
 $ var2: Ord.factor w/ 5 levels "e"<"d"<"c"<"b"<..: 5 5 5 4 4 4 3 3 3 2 ...

On a sidenote, ordered() gives you an ordered factor, and lapply(X,...) is more optimal than apply(X,2,...) in case of dataframes.

like image 84
Joris Meys Avatar answered Jan 25 '23 23:01

Joris Meys


And the obligatory plyr solution (using Joris's example above):

> require(plyr)
> Y <- catcolwise( function(v) ordered(v, levels = letters[5:1]))(X)

> str(Y)
'data.frame':   15 obs. of  2 variables:
 $ var1: Ord.factor w/ 5 levels "e"<"d"<"c"<"b"<..: 5 4 3 2 1 5 4 3 2 1 ...
 $ var2: Ord.factor w/ 5 levels "e"<"d"<"c"<"b"<..: 5 5 5 4 4 4 3 3 3 2 ...

Note that one good thing about catcolwise is that it will only apply it to the columns of X that are factors, leaving the others alone. To explain what is going on: catcolwise is a function that takes a function as an argument, and returns a function that operates "columnwise" on the factor-columns of the data-frame. So we can imagine the above line in two stages: fn <- catcolwise(...); Y <- fn(X). Note that there are also functions colwise (operates on all columns) and numcolwise (operate only on numerical columns).

like image 21
Prasad Chalasani Avatar answered Jan 25 '23 23:01

Prasad Chalasani